Revision as of 06:03, 28 March 2012

Intro

How can we design an architecture that will achieve the desired quality attributes ?
Sources of architecture
- Theft: From previous systems, literature
- Method: Systematic and conscious, derived from requirements via transformations and heuristics.
- Intuition: Ability to conceive without conscious reasoning. Increased reliance on intuition increases the risk.
Ratio of usage of above three methods varies according to architects experience and novelty.

What is a tactic ? - A tactic is a design decision that influences the control of a quality attribute response.
A collection of tactics is an architectural strategy.
Each tactic is a design option for the architect.

All approaches to maintaining availability involve some type of redundancy, some type of health monitoring and some type of recovery when a failure is detected.
Availability tactics involve- Fault detection, fault recovery and fault prevention.

Ping/echo and hearbeat generally operate among distinct processes and the exception tactic operates within a single process.

One component issues a ping to a component to be checked and expects to receive back an echo within a predefined time.
Response time allows performance to be assessed.
If bandwidth consumption of pings is an issue, then the ping/echo detectors can be organized in a hierarchy.
- Low-level detector pings low level processes and higher level fault detectors ping lower level ones.

One component emits a heartbeat message periodically and another component listens for it.
Absence of heartbeat means originating component has failed.
Heartbeat messages can be combined with useful data.

Exceptions encountered during an exception.
Exception handler is invoked which typically executes in the same process that introduced the exception.

Fault recovery consists of preparing for recovery and making the actual system repair.

Processes running on redundant processors each take equivalent input and compute a simple output value that is sent to a voter.
Voter detects deviant behaviour from a single processor - then it fails it.
Different choices of voting algorithm - "majority wins" or "preferred component".
Often used in control systems to correct faulty algo's or processors.

There are N redundant components - all of which respond to events in parallel.
Response/output from only one component is used though and rest are discarded.
Downtime is minimal, because backups are current and time to recover is only the switching time.
E.g. LAN with a number of parallel paths and redundant component in a separate path.
Synch is done by ensuring that all msgs to any component are sent to all redundant components, therefore a reliable transmission protocol may be required.

One component (the primary) responds to events and informs the other components (the standbys) of status updates.
When a fault occurs, backup state on standby must be fresh before resuming services.

Standby spare platform.
Must be rebooted to the appropriate software config and the state must be initialized to the point where the failure occurs.
Therefore checkpoints of the system state must be made regularly.

@@ Line 43: / Line 43: @@
 * Fault recovery consists of preparing for recovery and making the actual system repair.
-=== Voting ==
+=== Voting ===
 * Processes running on redundant processors each take equivalent input and compute a simple output value that is sent to a voter.
@@ Line 49: / Line 49: @@
 * Different choices of voting algorithm - "majority wins" or "preferred component".
 * Often used in control systems to correct faulty algo's or processors.
+=== Active Redundancy (Hot restart) ===
+* There are N redundant components - all of which respond to events in parallel.
+* Response/output from only one component is used though and rest are discarded.
+* Downtime is minimal, because backups are current and time to recover is only the switching time.
+* E.g. LAN with a number of parallel paths and redundant component in a separate path.
+* Synch is done by ensuring that all msgs to any component are sent to all redundant components, therefore a reliable transmission protocol may be required.
+=== Passive Redundancy (Warm restart) ===
+* One component (the primary) responds to events and informs the other components (the standbys) of status updates.
+* When a fault occurs, backup state on standby must be fresh before resuming services.
+=== Spare ===
+* Standby spare platform.
+* Must be rebooted to the appropriate software config and the state must be initialized to the point where the failure occurs.
+* Therefore checkpoints of the system state must be made regularly.