The Art of Being a Successful DBA – Poka-Yoke and Paranoid DBA Best Practices

Ever look at a screen’s output and get that puckered feeling in the pit of your stomach? If you have been working in this profession for any amount of time, you know the feeling I’m talking about. The feeling that makes you think you would rather be living in Montana making woodcarvings at a roadside stand than being a DBA. I’ll be taking a somewhat lighthearted look at the perils of our profession and discuss ways to reduce problem occurrences.

The Perils of our Profession
One of the common challenges that all DBAs face, no matter what vendor’s database they work on, is the absolute attention to detail our profession demands. Switch a couple of characters in a script, forget to set your SID, set the wrong flag at the wrong time and the end result usually isn’t very pretty. Many commands we issue on a regular basis are destructive by their very nature. This is the reason why I have a great respect for all technicians who have selected database administration as their chosen profession.

I know they have all experienced that uncontrolled “eye-twitching” at 2 AM when they are ready to hit the final enter key to execute the command. You know what command I’m talking about too. It’s that one command that you really, really, really hope is going to come back with a successful return code and ultimately end with a database that is finally usable. Whether it’s a recovery, a file fix or corrupt data is immaterial, it’s the wait that we are talking about.

There is no longer wait in our profession than waiting for the message below after a database recovery:

SQL> Database opened.

Time always seems to stand still. The longer the recovery, the messier the recovery.  The more critical the database – the longer you wait. You stare at the screen hoping beyond hope that the above message will appear. It’s the ritual cross your fingers, spin around three times, face towards Oracle headquarters and pray to everything that is Larry Ellison wait. I’ve actually caught myself mumbling, “Come on, come on, come on…” I don’t care how sure you are of your capabilities, or how much of an Oracle “Ace” you are – you know the anticipation I’m talking about.

You then either breathe a sigh of relief or you are in absolute disgust when you see an Oracle error message appear. How about the old “File 1 needs more recovery to be consistent” or the “File 2 not restored from a sufficiently old backup”? Those messages are enough to make anyone cringe. I’m an ex-Oracle instructor. I’ve seen those messages A LOT in class. I still cringe.

At a previous job, I once had to run through 36 hours of tapes to restore a multi-terabyte warehouse. A disaster occurred that required us to do a recovery. THAT was the longest wait for a database open message I ever experienced. One of my fellow DBAs asked if I needed a brown paper bag to breathe into.

Or it’s the command that drops the schema in the test environment that will allow you to do a refresh from production. It’s that test database that runs on the same box as production. The environment that makes you do a “SELECT name FROM V$DATABASE” command in SQL*PLUS about 15 times in a row before you finally execute the “DROP USER CASCADE” statement.

Not only must we try to prevent our own mistakes, we must safeguard our environments against the mistakes of others. Operating system administrators, disk storage technicians and application developers are just like us. We are all part of the human community that makes mistakes from time to time.

If you never make mistakes, send me a resume. I’m always looking for a “Patron Saint of Oracle” here at Remote DBA Experts. It will also save us on travel costs because I’m sure you’ll be able to spread your wings and fly here on your own.

But as my old boss Dan Pizzica used to tell me (when I was a VERY junior DBA) “It really doesn’t make a difference who broke the database. You are the technician who is ultimately responsible for fixing it. The buck stops with you. If you can’t protect your environments, you aren’t doing your job.” We all know he’s absolutely correct.

Then there’s the software glitches. The problems that pop up out of the blue and make you go:

“WHAT THE? – How did THAT happen? I’ve done this 317 times in a row and it worked every time.”

For you math majors, here’s my calculation for this:

CLOSER YOU ARE TO PRODUCTION TURNOVER
+ THE GREATER THE VISIBILITY OF THE PROJECT
= THE MORE LIKELY A PREVIOUSLY UNKNOWN SOFTWARE GLITCH WILL OCCUR

I don’t care what software you are using, you will run into the “only occurs on this release, on this version of the operating system, using this particular feature on the third Tuesday of the sixth month when it’s cloudy outside” BUG. Be sure to expect management to stop by and ask “well, why didn’t you test this on the third Tuesday of the sixth month when it was cloudy outside?”

The more complex the database ecosystem, the more paranoid I become. Which is why I’m not a follower of “the database is getting so easy – we won’t need DBAs” mantra that mindless industry pundits profess on a seemingly endless basis.

So now we know that our jobs are somewhat unforgiving and we do make a mistake from time to time. What can we do to reduce the chance of an error occurring?

Poka-Yoke for DBAs!
We recently had a Poka-Yoke contest here at Remote DBA Experts.   We hold contests on a regular basis to jumpstart the creative process on activities that we feel strongly about.   We are big proponents of Poka-Yoke”.  Poka-Yoke is a Japanese term that means “fail-safing” or “mistake- proofing”.   Wikipedia’s definition of Poka-Yoke is: “its purpose is to eliminate product defects by preventing, correcting or drawing attention to human errors as they occur.”

Since I’m a car nut, here’s a couple of automotive Poka-Yoke examples.    You can’t take the keys out of most modern cars until the car is in park.  In addition, most cars won’t allow you to shift out of park until the key is in the “ON” position.   How about gas caps that have the little tether that prevents us from driving off without the cap?   Most gas caps are also attached using a ratchet assembly that ensures proper tightness and prevents over tightening.

Take a look around you, you’ll see dozens of Poka-Yokes during your daily activities:

  • The little holes in bathroom sinks that prevent overflows
  • Microwaves will stop when the door is opened
  • Same thing with dryer doors
  • Lawn movers that have a safety bar that must be depressed before they will run
  • Disk brakes that begin to make a noise before they are completely ground down
  • Rumble strips on roads

The list really is endless.  We have applied the Poka-Yoke process to our daily activities here at Remote DBA Experts.  We have checklists, process documentation, best practices, sign-off sheets – the works. One of the first blogs of this series is a discussion on the importance of good documentation.

I’d be very interested to learn your Poka-Yoke ideas!   If you have a Poka-Yoke idea, please respond and we’ll be glad to discuss it.  Here’s some general ones that I recommend.

The Second Set of Eyes
As I have stated in previous blogs, I have over 20 years of experience using Oracle and have done my fair share of database backups and recoveries. During my career as an Oracle instructor, I have assisted in hundreds of database recoveries in Oracle’s classroom environments. If possible, I still have others review my recovery strategy and recovery steps before I begin the recovery process. I used backup and recovery just as an example. Whatever the process is you are are performing, a second opinion may prevent you from making a mistake. A review from a fellow DBA has saved me more than once. I may be described as having an ego (I have no idea where they get that opinion) but it doesn’t prevent me from asking for help from others.

We were recently correcting a very poor backup script created by a customer’s previous database support vendor. The customer described this particular environment as “if it goes down we lose our ability to make money” application. After our massive set of changes were complete, two of us went line-by-line verifying each line of the backup script. At the end of each script we asked each other “are you OK with this”? Only then did we move on to the next one. I don’t care how much time you have “in the seat” using Oracle, you need to put your ego aside at times and have someone check your work on critical activities.

Concentration
I used to work for a shop that subscribed to “the everybody in one big room” philosophy. I guess it was supposed to allow everyone to work together as a team and become as “one with each other”. It may have achieved that purpose but it sure didn’t allow you to concentrate on your work very well. You could hear so many different conversations they had to pump in white noise. The constant ‘whhhsssssshhhssshhh” noise made me feel like I was a crewmember of the Starship Enterprise. I always wondered when Captain Picard would walk through.

Like all DBA units, our particular area was often populated with various developers and O/S technicians. Many different conversations were occurring, some that could be described as somewhat animated. The environment did not allow you to concentrate on the task at hand. We often had to go into small conference rooms to work on critical tasks.

The point I’m trying to make is that no matter what type of environment you work in; if you can concentrate OK, but if you are like me and you can’t, find a spot where you can. Block off some time, send questions to other DBAs and concentrate on the task at hand. Don’t attempt to answer questions and code a complex script at the same time. May seem obvious, but throughout my career I have personally watched numerous DBAs attempt to multitask when they are working on a critical process. It’s a recipe for a problem. Once you are done, follow rule number one and have someone review your work.

What Database Are You Working IN?
Working in the wrong database is a common problem for database experts as well as their less experienced counterparts. How many times have YOU found yourself running statements in the wrong environment? Feel free to include me in that not so select group. The operating system command SET can be used in Windows systems to display environment variables. The ENV command can be used to display the environment variables in UNIX. Many seasoned database administrators change their UNIX shell prompt in their profile to display the current Oracle SID. Displaying the current Oracle SID in the shell’s prompt provides a continuous reminder to the DBA of the database they are working in.  Google it – you’ll find dozens of scripts by your fellow DBAs.

Setting your SID Automatically During Logon
Here’s an easy recommendation – don’t do it. Many of the servers we work on contain several databases that use different database software releases. It is our standard to not set environments in logon scripts. We create shell scripts that have the database name in their title to set our SID, ORACLE_HOME, etc..

It is a little harder in Windows environments. During installation, Oracle often uses the first database created on the server as the default SID. You can override this configuration, but you’ll need to read the Oracle installation guides to find out how. The manuals will show you how to override the default database on Windows operating systems.

Saving Time VS Creating a Problem
I once watched a fellow DBA perform a rather complex set of administrative tasks to solve a problem. He was rapidly flipping back and forth between at least 15 active screens, copying and pasting and editing and copying and pasting and editing… I describe this particular activity as “Multiple Screen Syndrome”.  He also had several other screens open that were connected to other databases. He was multi-tasking to its highest degree.   Take a break, take a breath and look at what you are doing.

How about the rm -r /u0*/ora*/prod*/*/*.* command in UNIX? It’s the command that drops multiple databases in multiple directories. All in one painful swoop. How many times have you heard of a mistake caused by commands like this causing mass mayhem? When you make a mistake like this, you become immortalized in conversations for years to come. Get a few technicians together after work and ultimately the conversation will include “remember when Bob so-and-so ran that big rm -r command by mistake and wiped out the entire O/S on our production web server? You can’t tell me you haven’t heard stories like this.

My opinion as a database manager is that I would rather you take your time than showcase your multi-tasking and time saving skills. The more complex and critical the activity, the more basic you should become in your plan of attack.  Trust me when I say I won’t be impressed with your time savings “cut and paste” and wildcard expertise if I think it can even remotely be dangerous.

Safety First Mindset
I once saw a DBA log in to a database using a particular schema account. He then logged into a different database using a DBA account and dropped the schema with the same name he was logged into on the first database. I asked him why he logged into the first database using the schema account he just dropped in the second. He stated “Oracle won’t let you drop a schema that is connected. No matter what happens after this, I’m positive that I won’t drop the user in this database by mistake.” I like that Safety First mindset in a DBA.

You need to think Safety First when you are performing any particular complex or critical activity. Take the time and put one or two safeguards in place like the DBA did when he dropped the user.

Other DBAs may call you paranoid, I’ll call you an experienced DBA that would rather be safe than sorry.

Wrapup
The intent of this blog was to not provide you with a laundry list of recommendations. It was intended to help jumpstart your creative juices to think about different methods to protect yourself against problems. If you have any helpful hints, please feel free to respond to this blog with your Safety First Tips and Tricks.

Thanks for Reading,

Chris Foot
Oracle Ace
ace_2
Director Of Service Delivery

RDBAELOGO

Share and Enjoy:
  • Digg
  • StumbleUpon
  • del.icio.us
  • LinkedIn
  • RSS
  • PDF
  • Google Bookmarks

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

21 Comments »

 
  • mdinh mdinh says:

    At a previous job, I once had to run through 36 hours of tapes to restore a multi-terabyte warehouse.

    Wow! I thought 16 hours for me was bad with the management query every hour.

  • Noons Noons says:

    Two lines I always add to .profile in my Unix environments (mostly, I use ksh):

    export ORACLE_HOSTNAME=`hostname`
    PS1=’${ORACLE_HOSTNAME}:${ORACLE_SID}$’

    Call me weird, but I’d much rather know at a glance from a command window exactly which node and database am I working with than for example the common “current directory” use for PS1. Particularly in these days of multiple instances per node.

    Another rule of thumb: never, ever, work on more screens simultaneously than can fit without overlapping in my monitor. It becomes too hard to keep track of what is going on in each and it’s only too easy to overlook something and type the wrong command in the wrong screen. Yes, I have a huge 30″ monitor! ;)

    One day someone will have to write the “Defensive DBA” book. Highly overdue.

    PS: no, ${PS1} does not need to be exported, for the last 15 years all Unix shells only need to export a variable once, after that any new value is automatically exported. Although folks continue to export everything in sight…

  • Ramesh Venkatakrishnan Ramesh Venkatakrishnan says:

    Good stuff. These tips may sound simple but very effective. “The seconds set of eyes” and “PS1″ are the masters in terms of effectiveness!

    The other noteworthy poka-yoke I can think of are:

    1) spooling the output

    Spool files not only serve as history/log. Its a great asset to review the script(s) execution. there may be script extending to multiple screens spool file helps us to review the status and we dont miss even a thing because of the scroll factor. Adding set echo on and set tim on doubles the benefit!

    2) using checklist

    Even if you have a done a particular task (say for example : database upgrades), following a checklist makes the job complete. Missing just one simple task might result in big impact at a later as those kind of stuff wont get noticed in the initial checkout. Using a Checklist helps to maintain the consistency also.

    3) nohup

    using nohup (or similar utility) where applicable is a good practice especially with the vpn timeout settings, this nohup comes in for rescue.

  • [...] Excellent artice by Chris Foot on being a good DBA. [...]

  • Marcus Mönnig Marcus Mönnig says:

    A trap I fell into:
    Using configured connections in SQL Developer and assuming that the descriptive connection name like “system@PROD@somehost” actually connects to PROD on somehost as user system… Ouch… Nothing fatal happened, but I looked at the wrong data for at least an hour while doing a performance analysis.

    Since then, I have a a folder structure in SQL Developer. Every connection inside a folder is double checked and can be trusted. Every connection directly below the root node is just a temporary connection and no assumptions can be made.

  • cfoot Chris Foot says:

    Good stuff is right Ramesh. If I think about your and Noon’s recommendations….

    Knowing what database you are in. Think about how many catastrophic errors could have been prevented if the DBA would have taken the time to display the SID using Noon’s script? I’m betting that virtually every DBA out there has at least thought they were in one database yet they were in another.

    How many times have we had a long running job terminated by mistake when we lost our connection to a server? If we would have run it with a NOHUP, we would have had no problems. Every DBA in this profession has had that happen to them at one time or another.

    Same with spooling the output. Very simple yet I think about how many times I forgot to spool the output. And how many times I could have really used that output that I forgot to spool off.

    Checklists are excellent examples of the ultimate Poka-Yoke. Simple and effective.

  • cfoot Chris Foot says:

    Marcus also has an excellent recommendation on SQL Developer. I’ll be adding that one to our list of Poka-Yoke ideas.

  • JJK JJK says:

    Thanks, Chris and everyone else, these are great tips, I thought I was the only DBA that use the trick to login to the Prod DB as the user you want to drop to safe guard dropping the wrong account. I have worked with many DBAs over the years and the good once use some or most of these tip. Another thing I have also notice is years of experience do not make good DBA, adopting these tips and integrating them in to your day-to-day work make a good DBA.

    I think you have to have an ego to want to be a DBA, but when you let your ego get in the way of you doing a good job then you have taken it too far. Being a good DBA or being good at anything is realizing what you do not know and not being to proud to ask for help.

  • Mike P Mike P says:

    I can sum that up with just 1 response…AMEN!

  • Thank You for Good Blog.

    If you never make mistakes, send me a resume. I’m always looking for a “Patron Saint of Oracle” here at Remote DBA Experts.
    Wow!
    I think everyone can do something mistake.
    But If about database. I think… If DBAs have a good skill, that they’ll check something what should to do before.

    rm -r /u0*/ora*/prod*/*/*.*

    that shouldn’t do… I think If DBAs have OS skills, they’ll know that make painful.

    Good Environment is helpful.
    However Skill and Experience can protect DBA do mistake.

  • Marcel Marcel says:

    If possible, try your `big` rm (or another )-command in a more forgiving environment first.

  • Norma Norma says:

    This article was absolutely RIGHT ON! I’ve forwarded the link to any number of other DBAs I know. I appreciate both your humor and insight. Thank You very much. As to different logins up at once, I not only labeled the screens’ pre-carat for the diff databases but used different colouring schemes just to MAKE SURE I was completely aware when I went to the RED background Yellow char of any production environment. This colour scheme is especially helpful with production calls at 2 or 3 am.

  • cfoot Chris Foot says:

    AHA! The colored screen idea is a great way to quickly differentiate between test and production environments. This recommendation is definitely one for us all to consider. How many times has an overworked, stressed out DBA found out that they were in the wrong database. Surely something that can be prevented but I think we can all agree how easy of a mistake it is to make!

    I’m sure there are some reading this and thinking “They shouldn’t be DBAs if they can’t tell what DB they are in.” My response to that would be BULL. A person gets sidetracked, stressed from overwork, tight deadlines, issues at home, not feeling well – whatever. IT HAPPENS. They mark of a good DBA is taking steps to prevent it – not being so egocentric to think it can never happen to them.

    Some of us possess an overabundance of talent for this profession, others are just plain lucky and the rest of us have to be thorough and prepared. I’m neither a DBA savant or that lucky so I have to be prepared.

    We at Remote DBA Experts will be considering the different colored screen mistake proofing idea at our next team meeting. All of these recommendations are great.

  • [...] attention to detail on which Chris Foot gives it a funny name: Poka-Yoke. You’ll have to read The Art of Being a Successful DBA – Poka-Yoke and Paranoid DBA Best Practices to know what it [...]

  • [...] attention to detail on which Chris Foot gives it a funny name: Poka-Yoke. You’ll have to read The Art of Being a Successful DBA – Poka-Yoke and Paranoid DBA Best Practices to know what it means.PlanetMySQL Voting: Vote UP / Vote [...]

  • Excellent post Chris, I think that over the time we all came up with tips like this and it is important to document them at the very same moment we came up with them (guilty of charge).
    I don’t only use different color screens, but also the position of them in the monitor, upper half and lower half are production and testing.

    Another one is to have always different passwords for production and testing environments, not only for sys/system but also for the OS’s oracle user. A co-worker of mine wanted to open a terminal to a testing environment, he clicked just below the right server and …. you know the rest of the story, don’t you? he was working so fast that he didn’t check SID, hostname or anything, in these cases a different password can help you stop a disaster.

  • Kevin Kevin says:

    “The environment that makes you do a “SELECT name FROM V$DATABASE” command in SQL*PLUS about 15 times in a row before you finally execute the “DROP USER CASCADE” statement.”

    This is sad but true! It’s indicative of the paranoia behind daily executing of task and statements that can bring the whole organization to a halt!

    It may sound corny but I still like the old fashioned method of keeping one session open on PROD when running database level commands and verifying the session. If I need to keep other multiple windows open I will go to a different screen on the solaris box where I have mulitple sessions open for status checking etc.

  • Narendra Narendra says:

    Good article. I regularly use the tips described in the article and comments.

    However, on AIX servers, I even change putty window title to show what Oracle sid has been set.
    For example:
    I will add following in .profile file:

    change_title=TITLE=’host_name:$ORACLE_SID;PROMPT_COMMAND=”33]0;${TITLE}07″; echo $PROMPT_COMMAND’
    dbsid=’export ORAENV_ASK=NO; export ORACLE_SID=dbsid; . oraenv;export ORAENV_ASK=YES;change_title’

    So whenever I type dbsid at command prompt, the window title changes to hostname:dbsid.

  • virtualmono virtualmono says:

    So true… whenever I have to perform some potentially destructive action, I tend to double-check several times before hitting the enter-key. I liked the part about concentration too – unfortunately it’s hard to avoid distractions by ongoing conversations around in a cubicle-environment (will management ever learn this?).

  • Ray Ray says:

    I live by a few simple rules:

    1. Always have a way un-Screw-up what you are about to screw up. (Rollback scripts, fresh backup…)

    2. “Never Delete Anything”. I get a lot of flack for this one. Of course, this needs to be done ‘within reason’ but I would much rather catch grief for over-running a backup disk than missing data.

    3. Disk Space is cheap… unemployment is expensive.

  • Craig Craig says:

    I have wrote up how I set my sessions between live and test here http://tinyurl.com/gcblog65 feel free to have a look, I set-up my putty sessions very visual to determine between the two

 

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Spam Protection by WP-SpamFree