Software development techniques behind the magic user interface

Multi-Touch Developer Journal

Subscribe to Multi-Touch Developer Journal: eMailAlertsEmail Alerts newslettersWeekly Newsletters
Get Multi-Touch Developer Journal: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Multi-Touch Authors: Ben Bradley, Qamar Qrsh, Suresh Sambandam, Jayaram Krishnaswamy, Kevin Benedict

Related Topics: MultiTouch Developer Journal

Multi-Touch: Article

Carrier Grade Linux - The Next Generation

This ain't yer daddy's telecoms industry any more

A transformation is taking place in telecommunications to meet the demands of new voice and data technologies. These technologies include Voice-over-IP (VoIP), the packet-switched alternative to old-fashioned circuit-switched telephony. To enable VoIP traffic, application servers must provide carrier-grade reliability that guarantees high service availability (99.999% uptime or better). These systems must also scale to handle hundreds of thousands of calls and provide predictable performance and high speech quality.

The telecommunications industry is undergoing enormous changes as equipment providers migrate from proprietary platform architectures to open software environments and commercial off-the-shelf (COTS) platform architectures. Open software and COTS hardware are seen as a means for rapidly deploying new voice and data services, while cutting capital expenses and operating costs, enabling equipment providers to stay competitive and profitable.

Carrier Grade Linux (CGL) stands at the center of the move to open architectures. About three years ago, a group of industry representatives from the platform vendors, Linux distribution suppliers and network equipment providers set out to define how "Carrier Grade Linux" could enable environments with higher availability, serviceability and scalability requirements, and so the Open Source Development Lab (OSDL) CGL working group was formed. Since its formation, the working group has produced two versions of a specification to define these capabilities. In response, Linux distribution suppliers are now demonstrating that they can meet the emerging telecoms needs by disclosing publicly how their products address the requirements defined in the Carrier Grade Linux Requirements Definition - Version 2.0.

Today, the CGL working group has grown to over three-dozen representatives of platform vendors, Linux distribution suppliers, network equipment providers, carriers and development community members worldwide (see Figure 1). This expanded group is now releasing a third version of the CGL requirements in stages. A technology version of the document will be released in early 2005 and a registerable version will be released in the second half of 2005. For clarity and ease of use, the specification has been split into seven separate topical documents:

  • Availability
  • Clusters
  • Serviceability
  • Performance
  • Standards
  • Hardware
  • Security
Drafts of these documents have been available for public review for eight months, so that feedback is incorporated in the definition of Carrier Grade Linux.

As CGL capabilities become available in mainstream implementations and distributions, Linux not only becomes more attractive for telecoms applications, but the entire Linux community benefits from a highly available, scalable, high-performance and manageable Linux environment.

High-availability middleware components and service-availability middleware that run on CGL systems are addressed by organizations such as the Distributed Management Task Force (DMTF), the Object Management Group (OMG) and the Service Availability Forum (SAF). The high-availability hardware platforms underlying CGL are addressed by organizations such as the PCI Industrial Computer Manufacturers Group (PICMG) and the Intelligent Platform Management Interface (IPMI) (see Figure 2).

Availability

Telecommunication customers expect their voice and data services to always be available. System availability is dependent on the availability of individual system components. To ensure 24/7 service, it must be possible to do system maintenance and system expansion on running telecommunication networks and servers without disrupting the services they provide. Systems must be able to withstand component failure, making redundancy of power supplies, fans, network adapters, storage and storage paths essential. Software failures can also significantly impact the availability of a compute node, so robust application software, middleware and operating system software is required for single-node availability.

The CGL Availability Requirements Definition - V3.0 is a collection of requirements that addresses the robustness of a single computing node. Availability is enhanced by clustering individual computing nodes so a node can't be a single point of failure. The single-node requirements in the Availability section can be categorized as:

  • Online operations
  • Redundancy
  • Monitoring
  • Robust software

Online Operations

Online operations enable a system to provide a service while software or hardware is replaced or upgraded. For instance, when a file system needs repair, the fix may require rebooting the system. However, CGL requires that it be possible to unmount a file system forcibly so repair sand remounting can be done without rebooting. The ability to replace or upgrade disks, processors, memory, or even entire processor/memory blades without bringing down the node or the network contributes significantly to continuous service availability.

Redundancy

A highly available system must have redundant components and be able to take advantage of redundant hardware so that it can function when a component fails. Ideally, designs can eliminate all single points of failure from a system. Using redundant communication paths, such as redundant network ports and host adapters, together with network failover software capabilities such as Ethernet bonding, improve network availability. Redundant storage paths such as redundant fibre channel ports and host adapters used with multipath I/O improve storage availability. Redundant memory components may not be possible, but error detection and correction can be used to mask memory cell failures; CGL requires software Error Correction Code (ECC) support. Single-bit errors are reported when they're detected in the hardware and logged by the kernel. The kernel invokes a panic routine whenever uncorrectable multi-bit errors are spotted.

Monitoring

Rapid detection of hardware or software failures requires health monitoring. Health monitoring is also needed to check for hardware or software that's beginning to fail such as ECC memory checking, predictive analysis for disks and processes that don't respond in a predictable way. Examples of CGL monitoring requirements include Non-Intrusive Monitoring of Processes and Memory Overcommit Actions. Non-intrusive monitoring of processes detects abnormal behavior by a process, such as process death, and initiates an action, such as the creation of a new process. Memory overcommits monitor system memory use and control process activity when memory use exceeds specified thresholds.

Robust Software

Robust software not only implies high-quality levels for operating system software, middleware and applications, but capabilities for maintaining and upgrading software without bringing the system down. In many cases, continuous service availability can be maintained. Live Patching enables process modification without process termination. Excessive CPU Cycle Detection finds abnormal process behavior by setting CPU use thresholds at various points in the process, catching problems such as infinite loops or thrashing, and initiates actions such as restarting the process.

To maximize system uptime, it's important to minimize the time a system is in an off-line state, such as shutting down or booting. The Fast System Startup requirement addresses the time it takes to reboot a system by specifying the ability to bypass the BIOS firmware. The Boot Image Fallback requirement defines a mechanism that enables a system to fallback to a previously good boot image in the event of a catastrophic boot failure.

The requirements defined in the CGL Availability Requirements Definition - Version 3 are common functions on proprietary carrier-grade systems and address the gap between Linux, which was developed with a desktop focus, and carrier grade systems where service availability is critically important.

Clusters

The CGL working group conducted a clusters usage model study from which it learned that no single clustering model meets the needs of all carrier applications. So CGL takes a more general approach to defining clustering requirements. It defines the functional components of a carrier-grade high-availability cluster (HAC). The requirements for other cluster models, such as a scalability cluster, a server consolidation cluster and a high-performance computing (HPC) cluster, have been treated as secondary to the requirements of an HAC cluster model (see Figure 3).

A high-availability CGL cluster is characterized by two or more computing nodes between which an application or workload can migrate depending on a policy-based failover mechanism. Essentially, the cluster nodes can "cover" for each other. Carrier-grade services must maintain an uptime of "5nines" (99.999%) or better and, quite often, a failing service must restart in sub-second time frames to maintain continuous operation.

A loosely coupled cluster model with no shared storage is a basic clustering technique suitable for many types of telecom applications servers. This model eliminates the possibility of a failed shared component affecting the availability of the service or the availability of the system.

Whether shared storage is implied or not, a cluster provides the following advantages:

  • It prevents a node from being a single point of failure. With hardware faults, the failing node can be replaced or repaired without affecting the service uptime (i.e., no unscheduled downtime)
  • It allows a software or kernel upgrade to be done on each node separately without affecting the availability of the service
  • It isolates failing nodes from the cluster and enables the service to continue using the remaining healthy nodes
  • It allows hardware upgrades on each node separately without affecting service availability
  • It enables increased capacity to meet load/traffic increases
The functional requirements of CGL clustering include support for redundancy (no single point of failure), not only at the cluster node level, but at the hardware level as well, including fans, power supplies, memory ECC, communication paths and storage paths. To support continuous operation of carrier-grade services, requirements are defined for node failure detection and various forms of service failover, such as application, node address and connections failovers.

The CGL clustering requirements are framed around industry standard programming interfaces. The Service Availability Forum (SA Forum) has developed an Application Interface Specification (AIS) that defines the service interfaces for clustered applications. The specification is OS-independent and is being used in both proprietary and open source cluster developments. The SA Forum AIS specifies a membership service API, a checkpoint service API, an event service API, a message service API, and a lock service API. AIS also specifies an availability management framework (AMF) that covers resource management and application failover policy in the cluster.

Serviceability

The CGL Serviceability Requirements Definition - Version 3.0 specifies a set of useful and necessary features for servicing and maintaining a system. Telecommunication systems such as management servers, signaling servers and gateways being managed and monitored remotely, have robust software package management for installations and upgrades, and mechanisms for capturing and analyzing failure information. A single point of control is required for applications, software, hardware and data for functions such as data movement, security, backup and recovery.

CGL systems will support remote management standards such as Simple Network Management Protocol (SNMP), Common Information Model (CIM) and Web-Based Enterprise Management (WBEM). Local management standards include IPMI and the Service Availability Forum's Hardware Platform Interface (HPI).

Debuggers, application and kernel dumpers, watchdog triggers and error analysis tools are needed to debug and isolate failures in a system. Diagnostic monitoring of temperature controls, fans, power supplies, storage media, the network, CPUs and memory are needed for quick failure detection and failure diagnosis.

Performance

The CGL Performance Requirements Definition - V3.0 is a collection of requirements for the Linux operating system that describe the performance and scalability requirements of typical communications systems. Key points include a system's ability to meet service deadlines, scale to take advantage of symmetrical multiprocessing (SMP), hyper-threading technology and large memory systems, and provide efficient, low-latency communication.

Without predictable scheduling latencies, service deadlines might not be met, resulting in dropped calls, unreasonable call-response characteristics, or even the entire service dropping out of active operation. Soft real-time scheduling provides predictable scheduling latencies in defined loads. Latency and scheduling parameters have to be configurable at runtime; the scheduling quantum must be configurable to 1ms or less. Protection against priority inversion is also required to maintain predictable scheduling.

To take advantage of scalable hardware architectures, CGL specifies support for SMP and hyper-threading technologies that includes process affinity and interrupt affinity capabilities. Large memory systems of more than 4GB of physical memory are needed to handle the memory demands of scalable communication applications.

Protocol stacks have to be prioritized so certain protocols can take scheduling priority over less important network protocols. To improve latency and reduce CPU usage in network communications, zero-copy network protocols may be needed. IPv6 forwarding tables have to be compact and use only a small amount of memory. Support in the Linux kernel for a 9000-byte maximum transfer unit (MTU) is required.

Standards

One goal of the CGL effort is to achieve high reliability, availability and serviceability (RAS) as well as application portability in order to leverage mature and well-established industry standards that are common and relevant to the carrier-grade environment and include them as part of the CGL requirements.

Open standards are important because they are freely available to anyone or any organization to use and because open standards can evolve with wide community feedback and validation. The CGL working group is actively involved with recognized standard bodies, such as the Linux Standard Base (LSB), a Free Standards Group initiative and the Service Availability Forum (SAForum). These organizations produce standards and specifications that address the RAS and application portability gaps between Linux as it exists today and where it needs to go to support highly available communications applications.

The first precept in the CGL Standards Definition - Version 3.0 voices the CGL working group's desire to work alongside recognized standards bodies:

CGL 3.0 specifies the need for compliance to the Linux Base Standard (LSB) version 2.0.1 to ensure a CGL 3.0 distribution will have the support for the same level of the application binary compatibility as is required by the LSB standard.

CGL 3.0 also requires implementation of the latest interface specifications from the SA Forum to provide a common set of standards and building blocks for high-availability architectures and platform management. The SA Forum provides standards specifications that define interfaces for cluster-aware applications (Application Interface Specification - AIS version B.01.01) and platform management applications (Hardware Platform Interface - HPI version B.01.01). See www.saforum.org.

Expanding on previous cuts of the CGL specifications, the CGL Standards Definition - Version 3.0 adds more POSIX compliance requirements based on IEEE 1003.1-2001. This added POSIX compliance is intended to bridge the application portability gaps as mainstream communications applications are ported to the Linux environment.

Various other standards requirements are included in the CGL Standards Definition - Version 3.0 to address the networking, communications and platform needs of carrier environments. Standards requirements such as Stream Control Transfer Protocol (SCTP), Internet Protocols (Ipv4/IPv6), Mobile Internet Protocol (MIPv6), Simple Network Management Protocol (SNMP), Intelligent Platform Management Interface (IPMI), IEEE 801.Q (virtual LAN), Diameter, Common Information Model (CIM), Web-Based Enterprise Management (WBEM), Advanced Configuration and Power Interface (ACPI), PCI Express and Trusted Platform Module (TPM) are included.

More open industry standards will mature and be recognized over time. The CGL working group will evaluate them for future versions of the CGL requirements. The group believes that adopting open standards in mainline Linux offerings will benefit application developers and solution providers and will carry Linux to the next level of popularity in the communications industry as well as the general Linux user community.

Hardware

To stay competitive and profitable in the telecommunication industry, standards-based, modular, COTS hardware is being used along with open software, including operating systems, middleware and applications. A goal of the CGL working group is to promote the migration of the telecommunication industry from proprietary hardware to COTS hardware by insuring that the Linux environment adequately support these COTS platforms. The CGL Hardware Requirements Definition - Version 3.0 identifies a set of widely used platforms and defines the support that's needed for them in the operating system. The scope of these requirements applies to the Linux kernel, kernel interfaces (APIs and libraries), system software and tools.

The CGL Hardware Requirements Definition - Version 3.0 specifies a set of generic requirements that are common across platform types. It includes support for blade servers, hardware management interfaces and blade hot-swap events. To address the need to manage highly available carrier-grade systems through hardware out-of-band mechanisms, management capabilities such as those found in the Intelligent Platform Management Interface (IPMI) are also described.

Carrier-grade systems require high-performance, high-throughput inter-connections in a system and between system nodes. Hardware-related requirements, such as PCI-Express support, Message Signal Interrupt and PCI-Express Device Hot Plug, are included. Other hardware-related requirements such as a CPU throttle mechanism, a "suspend to disk and resume" capability, trusted platform module (TPM) support and boot-loader integrity check are also specified.

Considering the diverse hardware used in carrier-grade environments, the CGL Hardware Requirements Definition - Version 3.0 doesn't define the requirements for just one type of platform. Instead it defines generic platform requirements and then provides an "Industry Platforms" section of implementation guidelines for specific architectures. Examples of such platforms include AdvancedTCA, BladeCenter, CompactPCI and rack-mount servers.

Security

The CGL Security Requirements Definition - Version 3.0 weren't released with the other version 3 CGL specifications in February 2005. It's due later in 2005 when the CGL security profile definition has stabilized and an approach to security in a telecommunications environment has been defined.

The telecommunications environment is very different from a general-purpose computing environment. The most salient differences to consider in developing a CGL threat model are:

  • CGL systems don't have many user accounts, and their accounts don't reflect individual users.
  • CGL systems are configured through custom user interfaces, not through shell access, which is often unavailable.
One of the assumptions the CGL working group makes about the environment is that administrators are trusted and competent, which eases the security burden on the underlying Linux platform.

The major threat to the telecommunications environment is, therefore, unauthorized access to management and control interfaces by outsiders. These outsiders can gain access by subverting the operating system or one of the applications it's running.

A potentially severe security threat arises when applications need to touch multiple security planes. Many telecommunication services can be provisioned remotely by the end user. Many ISPs that offer domain hosting allow customers to create new mailboxes or route incoming calls to five-digit work extensions to any phone number in the world with just a few clicks on a web page. Facilities like these create a new set of risks:

  • Unauthorized re-routing of e-mail and phone calls by disgruntled associates or unscrupulous competitors.
  • Exploitation of software vulnerabilities to "jump" from one security plane to another, which can lead to many kinds of risks.
Mitigating these risks will require forethought so users are properly authenticated and authorized and that information traveling between planes passes through narrowly defined interfaces that protect against unauthorized access.

An OSDL special interest group for security (the security SIG) is generating a CGL security profile and architecture approach for security on communications servers.

Summary

CGL 3.0 specifications are an upward-compatible superset of the CGL 2.0.2 specification. In 2003 and 2004, member companies produced communications products based on the 1.1 CGL specs. In the latter half of 2004, Linux distributors began to announce Linux offerings based on the CGL 2.0.2 specification. It's expected that products based on CGL 2.0.2 will begin to appear this year. A smooth transition is expected for carriers and equipment providers as Linux distributions incorporate CGL 3.0 capabilities in 2005 and 2006.

Development is underway on many of the CGL capabilities that don't appear in mainline distributions. While CGL requirements are specified to Linux platforms in the communications industry, a high-availability, high-performance, scalable system is viewed as beneficial to the entire Linux community.

Links and References

  • OSDL Corporate Site: www.osdl.org/
  • Carrier Grade Linux home page: www.osdl.org/lab_activities/carrier_grade_linux/
  • Carrier Grade Linux Glossary: www.osdl.org/lab_activities/carrier_grade_linux/glossary.html/document_view
  • Carrier Grade Linux Datasheet: www.osdl.org/docs/cgl_datasheet.pdf
  • CGL 3.0 Draft Documents: http://developer.osdl.org/cherry/cgldrafts/index.html
  • CGL Documents: www.osdl.org/lab_activities/carrier_grade_linux/documents.html/document_view
  • CGL Registration: www.osdl.org/lab_activities/carrier_grade_linux/registration.html/document_view
  • OSDL Developer Resources: http://developer.osdl.org/
  • Storage SIG: http://developer.osdl.org/dev/storage/
  • Hotplug SIG: http://developer.osdl.org/dev/hotplug/
  • Security SIG: http://developer.osdl.org/dev/security/
  • Binary Testing SIG: http://developer.osdl.org/dev/brt/
  • Clusters SIG: http://developer.osdl.org/dev/clusters

    Carrier Grade Linux Specification Editors

  • Availability: Takashi Ikebe (NTT) - ikebe.takashi@lab.ntt.co.jp
  • Serviceability: Joel Krauska (Cisco)- jkrauska@cisco.com
  • Performance: Eric Chacron (Alcatel) - Eric.Chacron@alcatel.fr
  • Standards: Terence Chen (Intel) - terence.chen@intel.com
  • Clusters: John Cherry (OSDL) - cherry@osdl.org
  • Hardware: Bing Wei Liu (Intel) - bing.wei.liu@intel.com
  • Security: Ge' Weijers (Sun) - Ge.Weijers@Sun.COM
  • More Stories By John Cherry

    John Cherry is the roadmap coordinator for the Carrier Grade Linux initiative at OSDL. He has managed kernel developers locally and has
    coordinated initiative activities around the world for over two years at OSDL. John has over 20 years experience in enterprise computing and
    system software at companies such as Floating Point Systems, Sequent Computer Systems, and IBM. John holds a Bachelor of Science in Electronic Engineering Technology from DeVry Institute of Technology.

    More Stories By Takashi Ikebe

    Takashi Ikebe is a senior open source development engineer with NTT Network Service Systems Laboratories. Within CGL, he participates in the Specifications Group.

    More Stories By Terence Chen

    Terence Chen is a senior software engineering manager with the Open Source Technology Center at Intel Corporation. Terence has actively participated and contributed to the OSDL Carrier Grade Linux (CGL) effort since 2001. Within CGL, he participates in the Specifications Group and Technical Board.

    More Stories By Steven Dake

    Steven Dake has been involved in open source Linux for over 10 years. He was a key contributor to the team that developed the industry's first Carrier Grade Linux implementation at MontaVista. Within CGL, Steve is an active member of the technical working group where he participates on the Specifications Group and Proof of Concept Group.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.