Software development techniques behind the magic user interface

Multi-Touch Developer Journal

Subscribe to Multi-Touch Developer Journal: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Multi-Touch Developer Journal: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Multi-Touch Authors: Ben Bradley, Qamar Qrsh, Suresh Sambandam, Jayaram Krishnaswamy, Kevin Benedict

Related Topics: MultiTouch Developer Journal, Java EE Journal

Multi-Touch: Article

Failover and Recovery of Enterprise Applications - Part 1

High availability - moving beyond clustering

Number of Servers in the Cluster
Several deployments are composed of the two-node cluster. While this is sufficient for low-volume, nonmission-critical sites, it is recommended that a fault tolerant cluster contain a minimum of three servers. This is because during the HTTP session replication, only the deltas are replicated from primary to secondary, and only when HTTP requests are made. So, consider a two-node scenario:

  • Node A goes down, leaving Node B unpaired, hence no session replication for any activity during this duration
  • Node B is now brought up, but there is no HTTP activity or the user does not modify the session values, so no session replication occurs
  • Node A now goes down and any values added to the user session that had occurred during the interval when server B was down can be lost
So, it's always a good practice to have a minimum of three servers in a cluster to have a highly available user session. Apart from avoiding user session loss, this architecture provides a higher redundancy.

Redundant Administrative Server Configuration
It is evident from the deployment architecture diagram that single point of failure in the platform is the administration server in a domain. Although failure of the administrative server does not have any impact on the availability of the application deployed on managed server instances, no further administrative functions can be performed. In addition, if the administrative server is down, no monitoring activities can be performed using the WebLogic console. To quickly recover from an administration server failure, administrators should maintain a backup of the configuration file (config.xml) and other associated files to quickly restart the admin server on any other host.

Managed Server Independence
In the event the administration server cannot be brought up due to unknown technical difficulties, it is always a good idea to configure the managed servers to start in Managed Server Independence (MSI) mode. Essentially, when a managed server is brought up, it attempts to connect to the admin server; if the admin server is down or unreachable, it looks for a file named msi-config.xml and boots itself by obtaining the connection from this file.

Startup Modes
To further reduce the time required to bring the managed servers, additional servers can be started in a STANDBY mode. Although the typical start-up time for a managed server is less than a minute, depending on the number of application deployments, the initial number of configured EJBs, and the number of start-up classes loaded at the start-up, it can take several minutes before the server reaches the RUNNING state. In the STANDBY state, a server has initialized all of its services and applications; it can accept administration commands, and can participate in cluster communication. However, it does not accept requests from external clients. In such cases, if a managed server is configured to start in STANDBY mode, it can be made available with significantly less down time.

Process Monitoring
Invariably in most of the deployments, administrators have scripts that continuously monitor the Java processes. In some cases these scripts merely check if the process is running and in other cases it performs a WebLogic utility PING to ensure that the server is in fact responding. An advanced level of monitoring may also be enforced to verify the availability of the application.

Node Managers
Node manager is a Java utility that runs as a separate process from WebLogic Server and allows administrators to perform common operations tasks for a managed server, regardless of its location with respect to its administration server. While use of the node manager is optional, it provides valuable benefits if your WebLogic Server environment hosts applications with high-availability requirements. Essentially, node managers monitor WebLogic Server's health and automatically restart the servers if they go down.

External Cluster Options
In addition to the services provided by the server and homegrown scripts, hardware clustering services can also be configured to monitor the server health and restart it. One big advantage of hardware-clustering services is the ability to start the servers on different machines using the virtual IP addresses. For example, Veritas cluster service can be configured to start the WebLogic Server instances on a separate machine if a configured number attempts to start the server on a node that has failed.

Using a Clustered File System
Use a common file system for all of the servers in the cluster. This will ensure that each of the servers has access to the transaction log files in the event another server needs to be started on a different machine. While a common clustered file system is valuable in ensuring that the transaction logs are available to all of the servers, it is strongly recommended that this file system be mirrored.

Avoid Pinned Services
While clustering has been widely used for a few years now, there are quite a few applications out there that are still running with singleton/pinned services. While in some cases this may be because of the business requirement, in others it may be because of a not-so-well-thought-out design. The architects must review the application design architecture to avoid using those services that are not cluster aware.

Session Replication
The value of clustering Web applications lies in the ability to replicate the user sessions to a secondary server in the cluster. Unfortunately, there are still many application developers who do not take this into consideration and tend to store objects in sessions that cannot be replicated or that construct session objects that cause significant performance overhead during session replication. Application developers should review the documentation on best practices for creating replicable user sessions.

More Stories By Sudhir Upadhyay

Sudhir Upadhyay is currently with Architecture and Shared services at JP Morgan Chase where he is an application architect. Prior to joining JPMorgan, he was a principal consultant with BEA Professional Services where he helped customers design and implement enterprise J2EE solutions. He is a BEA Certified WebLogic Developer and a Sun Certified Java Developer.

Comments (2) View Comments

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


Most Recent Comments
Thirunavukarasu 02/04/06 06:31:00 AM EST

A neatly explained technical atricle. Very useful, very crisp. Thanks .

Viktor Lioutyi 08/12/05 10:34:27 AM EDT

Hi,

How to test failover in automatic manner?

"In several scenarios, administrators end up performing basic failover testing by shutting down the processes and verifying that the subsequent requests succeeded.

Although this level of testing can satisfy the failover requirements for the records, more robust failover testing needs to be performed to ensure a proper recovery if failures do occur."

We did the manual testing and failover worked. But we would like to do automatic testing of failover to make sure that it works for all our 1000+ pages. BEA does not have any tool for such testing.

There are different reasons why someone may want to test all pages for failover.

1) WebLogic only replicates attributes that were modified. Call of session's setAttribute() method is an indication for WebLogic that attribute was modified. This call may be done explicitly or implicitly when jsp tags are used. It is possible that on some pages members of complex attributes were modified but WebLogic was not notified about it, so it will not replicate such attributes.

2) Complex attributes may reference other objects and attributes. After replication these references may be broken. For example, attribute A and B references object C. Only attribute A was modified, so only A will be replicated. After the replication A and B may point to different copies of C and program may not work correctly anymore.

3) Some objects are assumed to be singletons. Developer needs to provide special implementation for serialization to support replication of singleton objects. If this implementation is omitted, then replication may create copies of a singleton object.

4) Transient fields are not going to be replicated but there should be a recovery code that restores values of these fields after replication. Without testing we do not know if all our recovery code works correctly or not.

There are probably other reasons too.

Does anybody know about any tool for automatic testing of failover (or at least just session replication) for WebLogic and/or WebSphere?

Thanks,
Viktor