Better Safe Than Sorry
As IT organisations face greater pressure to manage, store, and retrieve increasing amounts of corporate information, they are searching for new ways to keep data readily available, whenever and wherever needed. To support the requirement for Business Continuity and Availability (sometimes referred to as ‘BC&A’), there are a wide variety of solutions available in the industry for various kinds of organisations. To name a few, cluster solutions, replication solutions and to some extent backup and recovery solutions qualify as BC&A solutions based on the requirement of the business.
For business requirements one solution which stands out and provides the most secure and robust mechanisms of data replication and availability of applications is storage-based replication. Based on the industry standard Fiber Channel (FC) protocol, this mechanism is the default choice for businesses today.
In a typical scenario, the infrastructure would comprise of servers, SANs, storage subsystems and IP networks for connectivity to the information islands. To achieve a true Disaster Recovery (DR) capability, the same and scaled down version of infrastructure has to be maintained at the DR site as well. So, there would be a set of servers in production using the production copy of data at the primary site and another set of servers at the DR/BC site waiting for the primary servers/site to fail. The servers on DR site can also be used in active-active configurations where these servers are performing some other computing tasks for the organisation.
To achieve Business Continuity and Disaster Recovery (BCP/DR), each one of these components has to be taken into consideration and robust methodologies applied to each component. Storage Area Networks and storage subsystems at the core of this infrastructure require special attention. For connectivity requirements between two sites refer to Figure 1 which states the technologies available and where they best fit. Other than the link between the sites, such technologies would essentially require SAN infrastructure, storage subsystems with licensed replication software (like PPRC, SRDF, True Copy, Continuous Access) and routers which would do long distance propagation and protocol conversion, if needed.
Long-distance SAN extension works on the storage-based replication of the storage subsystems and requires interconnects which can propagate the FC protocol over long distances. This propagation can be done in native FC protocol or by converting FC into IP and sending the signal over infinite distances. Although long distance FC connectivity is generally the most reliable, organisations sometimes prefer to send FC over existing long-distance IP links to take advantage of existing IP infrastructure.
It is essential to consider future needs when choosing the appropriate technology.
Today, the most common approach and also the most logical approach towards building a DR practice is to build a BC site not very far from the production site and then from thereon build a seismically distant DR site. Refer to Figure 2.
The major advantage of this approach is that companies don’t have to invest in redundant infrastructure and high cost links from day one. As and when business grows and demands another level of protection, DR sites can be built. However, there are some companies especially in the telecom and finance industries, which have invested in a complete DR site from day one. It is the business—rather the business impact in case of disaster—which defines the level of redundancy and justifies the case for two stage DR (multi hop DR sites) or a single stage mirror site.
Also, latencies of the networks involved play a vital role in achieving DR and BC. If the latencies are high and bandwidth is less then replications that are not in sync are generally proposed and implemented. This is also a function of qualification from vendors for different links and the acceptable latencies for their replication technology. While storage based replication software manages the data replication at block level from the storage sub systems, the application servers continue to access data non-disruptively and work on the local copy of data, i.e. the data residing in the storage on the local site. While making a decision for storage based utilities and storage sub systems, the organisation should ensure that these replication utilities are certified with the applications or databases they wish to replicate over to another site.
This certification merely ensures that the bits and bytes replicated would be consistent when they arrive at the DR site storage. These utilities are storage controller based and take the I/O load off the servers.
Now, imagine a scenario when disaster strikes. The applications and servers lose connectivity with the production copy of data (there is a high chance that servers are also down). At this moment, the replication utilities start a consistency check on the replicated data (or some of them are already doing it while replicating data) and ensure that each byte transferred is referable by the application servers on the DR site.
There is intelligence built into the storage sub systems which allows them to change the state of the data volumes at DR site from read-only to read-write enabled. This is necessary as the copy of data would be the production copy for applications until the primary site comes up.
Geographically dispersed cluster solutions play a vital part in such implementation as they sense the failure on the primary site and then take corrective action by mounting the volumes onto the servers on the DR site and firing up the applications.
The storage subsystem and replication utility ensure consistency; the cluster software along with volume managers ensures that this particular DR copy of data is consistently accessible to the servers and ready to be accessed by applications.
A financial organisation in Mumbai replicates its data to an alternate site, about two kilometers away, and has a full facility where key application users can come and sit on networked PCs and start working. As soon as they log on to the network, their profiles, applications and permissions are restored to that PC and they continue to work seamlessly. Testing has made it possible for this system to work all the time and more so at the time of disaster.
To summarise, many customers in India have got on to DR implementations and few others have successfully completed their projects as well. Implementing DR requires expertise from various areas of IT infrastructure; each one of these functional experts play a vital role in the success of the project.
The key to running a resilient IT setup is to implement DR practices, training the team members and making them aware of their roles and responsibilities. To implement and depend on DR/BC plans, a perpetual test process is absolutely necessary.