Project Management

Project Management in Real Life

by
Sharing my Project Management adventures and some tips. I like to keep my articles brief and to the point. Project Management is an Art, Science, and Discipline. Just keep it simple and have fun!

About this Blog

RSS

Recent Posts

Job Shadowing the Daily Work Routine

Mother Hen Leadership

Taming the Wild Wild West (Project Management) environment

The Hybrid-Plus called The LAW

Risk Register (Project Team Members)

Categories

Agile, Blog, Budget, Budget Creep, Budget Planning, Business Analysis, Career Development, Communications Management, Continuous Improvement, Contractor, Creative, Data Center, Deadlines, Disaster Avoidance, Disaster Recovery, Educator, Email, Football, Go-Live, Hawaii, Hybrid, Hybrid-Plus, Implementation, Kaizen, Kanban Board, Kickoff Meeting, KPI, LAW, Leadership, Lean, Lessons Learned, Meeting, Milestones, MS Project, New Release, Options, PMO, Presentation, Process Improvement, Productivity, Project, Project Coordinator, Project Management, Project Manager, Project Plan, Project Tracking, Projects, Proposal, Quarterback, Real Life, Requirements Management, Risk Assessment, Risk Management, Risk Management, Risk Register, Scope Creep, Scope Management, Slideshow, Software Development, Software Updates, Solutions, Stakeholder Management, Statement of Work, Status Report, Subject Matter Experts, Subprojects, Systems Administrator, Teams, Tips, Training, Transparency, Vendor, Waterfall, Whiteboard, Work Breakdown Structures (WBS), Zen

Date

The System Failure Assessment

linkedin twitter facebook Request to reuse this  

It's a good idea to evaluate all your systems to determine on a scale of 0 - 5 what level of risk you are exposed too. Develop a spreadsheet to track all these systems. Items for example to list on the spreadsheet are hardware, operating system, database, applications, vendors and any other information that you feel is important to support the system. While you are doing the assessment it's a good idea to update your Data Center floor plan diagram identifying where each system resides. Once you complete the assessment place a sticker on the hardware indicating that it was evaluated already and come up with a code to cross reference it back to your floor plan diagram.

Doing a system failure assessment is a big project that takes commitment. Depending on the size of your organization it could take months to complete depending on how deep you dig for information. I suggest you do a thorough examination that will expose all vulnerabilities. While performing the evaluation it's a good time to review all your support contracts. You also get a heads up to budget for system upgrades or replacements. You really find out a lot of information to present to management from a system failure assessment.

Grading your findings for example could be a 5 for the highest level of risk if it's a black box that just runs and no one really has a clue how to support it because the person who installed it left 10 years ago and it just keeps on running, don't re-boot it ever. If you are running on an old hardware platform and it's off support and you are unable to patch the operating system that will be a level 5 risk. Just a few examples to give some ideas on how to rank your risks.

The System Failure Assessment will need to be a living document updated periodically to keep it effective and valuable. It will serves as a disaster avoidance / disaster recovery document. It's your decision how to act on level 3 - 5 risks. Don't wait act now.

 

(Note - this article was originally written by Drake Settsu and published on DrakeSettsu.BlogSpot.com in February 2015)

Posted on: April 24, 2018 08:48 AM | Permalink | Comments (10)

The System Crashed Hard

linkedin twitter facebook Request to reuse this  

The Disaster 

A server containing multiple disks managed by a vendor for the State of Hawaii’s driver's licensing programs had multiple hard disks crash. The storage media is encrypted and secured with some data not readable. There was no security or data breach. It will take a couple of months to determine if the data can be recovered.

The Vendor

The spokesperson gives a briefing at a press conference similar to "the dog ate my homework".

1) There was a backup system in place that's supposed to protect the data when the hard drives crash, it was not properly configured.

2) They were not aware that certain documents or images were not getting backed up properly.

3) The backups are checked to make sure everything is working properly. Don't  know what the details are and what the plan was on checking the data.

Summary

When implementing a new system, I have few suggestions on how to make your system bulletproof. I used to be a Systems Administrator and I never lost any data on my watch.

1) The Project Manager in charge needs to have good Subject Matter Experts to recommend the technology, security, and procedures to be in place to ensure the system has the redundancy to withstand a disaster and prevent a security breach.

2) The Statement of Work needs to always include clear expectations.

3) Service Level Agreements need to be in place for the system.

4) Disaster Avoidance plan and Recovery Strategies to meet Service Level Agreements.

5) Test your systems periodically to ensure that the data is being replicated or backed up properly to a media that will go offsite.

6) Make sure you have a good hardware / software support contract in place.

7) Never ever trust the vendor. Make a checklist of your key deliverables to be reviewed and demonstrated when you begin the sign-off stage in the project. Any missing checks will result in a big missing check for that vendor to cash.

Posted on: March 27, 2018 10:18 AM | Permalink | Comments (9)

Your Database is toast get out the butter and jelly

linkedin twitter facebook Request to reuse this  

It's Aloha Friday morning on my way to work and I get get a call from the Data Center that the mission critical system for the organization is acting strange. Users are unable to access the system. Moments latter I get a text page from the system that a process that needs to run is gone. I had scripts monitoring the system so I knew something bad happened. 

I logon on to the system and it's not normal and the process that needs to be running was missing. I spoke with the Data Center staff and determined that an operator error caused the problem. The database that was in production was an interim database version to migrate to a new permanent database later in the month. It had known bugs that we just stepped on by accident.

I call an emergency meeting with management to brief them that the database has been corrupted. I contacted the application vendor to take a look at the database integrity. I advised management to implement downtime procedures. Hours elapsed working to fix the database with no luck. The database cannot be repaired. My only option now is the last system backup. I asked the Data Center Manager to bring the backup tapes to me. I’m holding in my hands the tapes of the mission critical system of the organization. I need a successful restore to minimize the window of data loss.

I call another meeting to break the bad news to management and staff. It’s going to be a long night we need to inform our users that the system will be down for an extended time and we will be giving periodic updates on the recovery progress. Key personal needs to be on standby and available when called.

The restore process is initiated. Restoring a large system takes hours and patience waiting for it to complete. Once the restore completes we need to run the integrity checker utilities to make sure no corruption exists on the restore. The system was validated to release back to the users with some minimal data loss between the backup and the time of the database corruption. Made some minor tweaks on the system. It's back on-line for everyone to use.

I got in the office at 8am Friday and it’s now Saturday 8am. I just experienced a System Administrators nightmare. When you are coordinating a major Disaster Recovery effort it's all about teamwork. You need to remain calm as you become the point person that knows everything that's going on. All eyes are on you. My Project Management experience definitely came into play on this recovery effort that had to be put together on the fly. You really know what you are made of when you go through an experience like a major system outage. It's a real good feeling seeing everyone work together as a team.

(Note - this article was originally written by Drake Settsu and published on DrakeSettsu.BlogSpot.com in February 2015)

 

* This incident happened a long time ago and I will never forget it. Technology has come a long way since that disaster. Today's technology can provide a faster recovery from a major disaster. 

Posted on: January 31, 2018 06:21 AM | Permalink | Comments (10)
ADVERTISEMENTS

"A common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools."

- Douglas Adams

ADVERTISEMENT

Sponsors