An AIX recovery can be necessary as a result of a number of events: the
loss of some system files, an unexplained system crash, a site
environmental problem, or simply a request for a system recovery test.
Either way, be prepared to hit the ground running and get the recovery
done—or be ready to pack your bags and say goodbye.
AIX recovery is a basic skill; there are no excuses for not having it or
not being prepared to use it as part of a disaster recovery (DR) plan.
AIX system recovery isn’t rocket science, but you need to have your wits
about you. This article will help you prepare to perform a recovery
quickly and with confidence.
Prepare, Prepare, Prepare
Key requirements for a successful recovery are an up-to-date
configuration listing of the target machine, a current system backup,
and application backups or re-installation media. Whether you’re dealing
with a full or partial restore, or a simulated or real disaster, the
processes involved are the same. If you’re prepared with these
prerequisites, your recovery will go smoothly; if not, you’re in for a
difficult time.
The best way to ensure that you’re prepared is to routinely (at least
weekly) create a system-bootable backup of your AIX servers to capture
the sort of periodic changes that occur on a regular basis, such as PTFs
and minor file changes. Also, track the status of applications and data
being backed up daily, because these components are much more volatile
than the OS itself. Typically, application backup is the responsibility
of an operational team, but as an AIX systems admin, you should be
informed that the data is being backed up successfully; after all, the
applications do reside on your machine. You should also take a
configuration report for each server. At minimum, this should include
the output of the following commands:
- lspv
- lsvg -l <vgname>, lsvg -p <vgname>, lsvg <vgname>(for all volume groups)
- lsslot -c slot
- lscfg -vp
- lsdev
- lsattr -El sys0
A script can collect this information for you automatically and archive
it off machine by, for example, emailing its output file to you. With
the information these commands provide, you’ll be on a good footing to a
confident recovery.
Expect the Unexpected
Recovering a system to a new server at a remote site typically involves
restoring the OS from a tape or DVD bootable backup. You can perform a
boot restore via the network if you’ve taken remote network system saves
with netboots (e.g., Storix or NIM), but this process is much slower
than restoring from a tape or DVD, and only the largest “hot site”
facilities have netboot host capabilities. The rest of us must make do
with bare-bones recovery from the trenches.
The restore-from-bootable-media process is straightforward. First,
because it’s best to start up without a network attachment, make sure
all Ethernet and other network cables (other than storage) are
unplugged. Next, insert the bootable media—tape or DVD—into a
boot-capable drive and start the system. It’s best if the server you’re
restoring to closely matches the specs of the failed server, but some
differences can be accommodated. For example, the root volume group
(rootvg) disk(s) might not be the same size, but as long as they’re
larger, not smaller, the restore will complete. You should be prepared
to alter some of the logical volume copies or re-size the logical
volumes during the AIX recovery process if your restore product allows.
Confirm Settings in New Environment
Confirm from the networks team or DR manager what IPs you’ll be using for the following:
- Host and gateway IP addresses (IPv4 and IPv6)
- Subnet mask
- DNS servers
- DNS entries (forward and reverse for all addresses owned by the host)
- Firewall, ipfilter, and/or tcpwrapper rules
- Printed copies of all customized directories showing ownership and permission settings
- Mail relay host (if your machine forwards mail)
- xntpd server
You might be on a different LAN or VLAN for the duration of the
disaster, so be sure to document the IP environment for the recovery
site so that you’re not fighting network issues during recovery
operations. And, of course, if your system interacts with other servers
or services, ensure that those are accessible from the recovery site.
Review Operational Parameters
Remember that all Ethernet cables should be disconnected at startup. If
the machine comes up with the network interface disabled, that’s good;
if it comes up enabled, you forgot to take out the Ethernet cables,
which can complicate startup troubleshooting. (You don’t want some
automated application process kicking off uncontrolled sessions.) When
the AIX recovery boot-up completes, it’s time to check all the
operational parameters, and then check them again. Review the
/etc/inittab file, comment out any non-required services you don’t want
started, then refresh the inittab with telinit -q. Check out root’s
crontab and review any non-required periodic jobs that might start. Once
you’re satisfied that all application processes and undesired mail
sending processes are commented out, stop or kill any processes that
might have been kicked off before you reviewed /etc/inittab and
crontabs. You might want to delete any outbound queued email files held
in /var/spool/mqueue because the mail system might try to send those
messages, which you might not want until you’re ready for full
production operation.
Next, stop and re-start sendmail so you have a clean mail agent running.
Review any firewall, ipfilter, and tcp wrapper rules you have; these
will undoubtedly have to be amended now that you’re in recovery mode and
in a geographically different environment. If your machine’s database
applications use raw devices, be sure to check the ownerships of these
devices in /dev, because these likely would have been changed on a
system restore. Most databases use async I/O; check that your databases
are running using pstat -a. If your machine is on AIX 6.1 or later,
database processes are started automatically. On AIX 5.3, you’ll
probably need to start them up.
On the Network
Bring the machine onto the network by connecting the Ethernet cables
(you should have already configured the net interfaces). Verify that you
can ping the network gateway (both IPv4 and IPv6 if you use it), your
DNS server, and any necessary collaborative servers. Validate that your
configured DNS correctly resolves local and global names, and give
special attention to reverse name resolution for the IP addresses owned
by the AIX system you’re recovering. One of the most common root causes
of startup failure is missing DNS entries for the new network
environment.
If static routes are required to reach any internal or WAN networks
other than through the default gateway, use the netstat -rn command to
verify that the routes exist, and add them if needed. Stop and start the
sshd service if it’s present (from the console, or you’ll cut off your
command-line session). Test a remote connection, such as Telnet or ssh,
to ensure you have remote access capabilities. Next, begin the xntpd
service to start getting the machine time synced, and verify it with the
date command. You should now be able to send a test email to make sure
sendmail forwarding works:
#echo "test mail" | mail admin@unixmantra.com
Now you're ready to configure your data volumes.
Bring In the Disks
Internal data volumes won’t typically be saved with the system bootable
backup. You must restore them separately, so be sure your DR plan
includes the instructions for this step. If you use a Storage Area
Network (SAN), the SAN volumes might reside at a remote site. If so, be
sure to get iSCSI or FC zoning correct—there’s no time to mess
around—then run cfgmgr to bring them in. The same goes for locally
attached disks. Be sure to create your disk raid configuration, if
required. If you’re only going to be at DR for a few days, you can
generally forgo RAID altogether—the complexity isn’t worth the risk of a
disk failure during DR operations.
Create the volume groups and file systems based on the configuration
reports you captured previously. It might be advantageous to create a
script when you're gathering your reports of the host configurations;
this lets you automatically create the file systems and saves you a lot
of time, as I’ve learned from experience.
Restore the Application Data
As I noted earlier, your application data must be backed up separately
from the bootable OS media, and thus must be restored separately. If
you’re using a third-party product for your application backups, check
that the client is running and talking back to the remote backup server.
Next, restore the applications and the data (if you do incremental
backups, ensure the operational team has the full list of tapes
required). This is typically the operational team’s responsibility, so
be sure to hurry them along. When all the data is recovered, review the
permissions of the base directories or file systems, then review them
again. Once you’re satisfied, prepare to start up the services in a
controlled manner, one by one. If you have databases to restore, make
sure you have the latest dumps before restoring them. Review the
processes running and consult with the applications’ support teams so
that there are no issues. If everything looks good, stop all
applications.
A reboot with Pause
Now’s the time to test that the machine can reboot. You might be
thinking, “Why do this; let’s just get the machine recovered?” Well, if
the machine goes down at a working DR site, it doesn’t reflect well on
you or your team, so run this test now before you release the machine to
the users. There are many factors that could stop an automatic boot,
and because your initial boot was closely attended, you might not have
encountered or noticed them. Simple things such as an incorrectly seated
Ethernet cable or an IP address conflict can cause a reboot to stop and
wait for manual intervention, so a trial reboot is essential.
First, clear the errorlog with errclear 0 so that you have a clean error logging sheet. Issue the bosboot and then the reboot commands. You should always issue a bosboot before
any reboot or shut down because it’s a good habit to have. If for some
reason the boot hangs, count your lucky stars that you discovered the
problem now.
A Final Cross-Check, Please
Once the machine comes back up, check that all services are up. Get the
support team to connect to the applications. Then relax and wait for the
phone calls to come in on some other tinkering that needs to be done.
This is inevitable, I’m afraid; however, the bulk of your work is now
done.
No comments:
Post a Comment