Difference between revisions of "Architecture Tactics"

From Suhrid.net Wiki
Jump to navigationJump to search
Line 41: Line 41:
 
== Fault Recovery ==
 
== Fault Recovery ==
  
* Fault recovery consists of preparing for recovery and making the actual system repair.
+
* Fault recovery consists of preparing for recovery and making the actual system repair as well reintroduction of components after repair.
  
=== Voting ===
+
=== Preparation and Repair Tactics ==
 +
 
 +
==== Voting ====
  
 
* Processes running on redundant processors each take equivalent input and compute a simple output value that is sent to a voter.  
 
* Processes running on redundant processors each take equivalent input and compute a simple output value that is sent to a voter.  
Line 50: Line 52:
 
* Often used in control systems to correct faulty algo's or processors.
 
* Often used in control systems to correct faulty algo's or processors.
  
=== Active Redundancy (Hot restart) ===
+
==== Active Redundancy (Hot restart) ====
  
 
* There are N redundant components - all of which respond to events in parallel.
 
* There are N redundant components - all of which respond to events in parallel.
Line 58: Line 60:
 
* Synch is done by ensuring that all msgs to any component are sent to all redundant components, therefore a reliable transmission protocol may be required.
 
* Synch is done by ensuring that all msgs to any component are sent to all redundant components, therefore a reliable transmission protocol may be required.
  
=== Passive Redundancy (Warm restart) ===
+
==== Passive Redundancy (Warm restart) ====
  
 
* One component (the primary) responds to events and informs the other components (the standbys) of status updates.
 
* One component (the primary) responds to events and informs the other components (the standbys) of status updates.
 
* When a fault occurs, backup state on standby must be fresh before resuming services.  
 
* When a fault occurs, backup state on standby must be fresh before resuming services.  
  
=== Spare ===
+
==== Spare ====
  
 
* Standby spare platform.
 
* Standby spare platform.

Revision as of 06:19, 28 March 2012

Intro

  • How can we design an architecture that will achieve the desired quality attributes ?
  • Sources of architecture
    • Theft: From previous systems, literature
    • Method: Systematic and conscious, derived from requirements via transformations and heuristics.
    • Intuition: Ability to conceive without conscious reasoning. Increased reliance on intuition increases the risk.
  • Ratio of usage of above three methods varies according to architects experience and novelty.
  • What is a tactic ? - A tactic is a design decision that influences the control of a quality attribute response.
  • A collection of tactics is an architectural strategy.
  • Each tactic is a design option for the architect.

Availability Tactics

  • All approaches to maintaining availability involve some type of redundancy, some type of health monitoring and some type of recovery when a failure is detected.
  • Availability tactics involve- Fault detection, fault recovery and fault prevention.

Fault Detection

  • Ping/echo and hearbeat generally operate among distinct processes and the exception tactic operates within a single process.

Ping/Echo

  • One component issues a ping to a component to be checked and expects to receive back an echo within a predefined time.
  • Response time allows performance to be assessed.
  • If bandwidth consumption of pings is an issue, then the ping/echo detectors can be organized in a hierarchy.
    • Low-level detector pings low level processes and higher level fault detectors ping lower level ones.

Heartbeat

  • One component emits a heartbeat message periodically and another component listens for it.
  • Absence of heartbeat means originating component has failed.
  • Heartbeat messages can be combined with useful data.

Exceptions

  • Exceptions encountered during an exception.
  • Exception handler is invoked which typically executes in the same process that introduced the exception.

Fault Recovery

  • Fault recovery consists of preparing for recovery and making the actual system repair as well reintroduction of components after repair.

= Preparation and Repair Tactics

Voting

  • Processes running on redundant processors each take equivalent input and compute a simple output value that is sent to a voter.
  • Voter detects deviant behaviour from a single processor - then it fails it.
  • Different choices of voting algorithm - "majority wins" or "preferred component".
  • Often used in control systems to correct faulty algo's or processors.

Active Redundancy (Hot restart)

  • There are N redundant components - all of which respond to events in parallel.
  • Response/output from only one component is used though and rest are discarded.
  • Downtime is minimal, because backups are current and time to recover is only the switching time.
  • E.g. LAN with a number of parallel paths and redundant component in a separate path.
  • Synch is done by ensuring that all msgs to any component are sent to all redundant components, therefore a reliable transmission protocol may be required.

Passive Redundancy (Warm restart)

  • One component (the primary) responds to events and informs the other components (the standbys) of status updates.
  • When a fault occurs, backup state on standby must be fresh before resuming services.

Spare

  • Standby spare platform.
  • Must be rebooted to the appropriate software config and the state must be initialized to the point where the failure occurs.
  • Therefore checkpoints of the system state must be made regularly.

Repair Tactics / Component Reintroduction

  • When a redundant comp fails, it may be reintroduced after it has been repaired.

Shadow operation

  • The previously failed component may be made to run in shadow mode to mimic behaviour of working components for a short time before making it operational.

State resynchronization

  • Restored component must have its state upgraded before return to service.
  • Ideal approach to update the state is a single atomic message. Incremental state upgrades lead to complicated software.

Checkpoint/Rollback

  • A checkpoint is recording of consistent states either periodically or in response to specific events.
  • System can be restored using a previous consistent checkpoint and a log of transactions since the last checkpoint was taken.

Fault Prevention

Removal from Service

  • Removes a component from operation to undergo activities to prevent anticipated failures.
  • For e.g. rebooting a component regularly to prevent memory leaks from causing a failure.
  • Arch strategy must be designed to support it.

Transactions

  • Bundling together of several actions so that entire bundle can be undone at once.
  • If one action is failed, entire transaction is failed.
  • Intermediate data doesnt corrupt output and affect rest of system.
  • Lock shared data - threads.

Process Monitor

  • Detect and shutdown failed processes,
  • New process instance created and state recovered.