Fault Tolerant Distributed
Systems
Lecture+Reading Course, 5points
M. PAPATRIANTAFILOU, PH. TSIGAS
INDEX
-
Course Summary
-
Course Structure
-
Literature (and electronically available student
reports)
-
Other Resources
-
Next Classes (expired)
-
Communication, Announcements
SUMMARY
We will study a range of ways to cope with failures in various types
of distributed systems (shared memory control systems, multiprocessors,
networks) and applications (operating systems kernel, concurrent languages,
network services). We will use some material from books, both on foundations
and on applications (to provide breadth of knowledge and to refresh some
introductory material, but mostly research publications (to go into some
depth in the hot (or cool :-) topics and to see some nice open areas of
research).
PLANNED STRUCTURE/SEQUENCE OF PROBLEMS TO STUDY:
It will be a combination of lecture+reading 5 point course.
A participant will be expected to give one presentation, to prepare
for each meeting (for asking questions, participating in discussion, ...),
and to write a small summary for one of the topics.
The first item will be covered by (2) lectures. There will be additional
lecture parts in the subsequent items, when some background/introductory
material is necessary to be presented. The rest will be presentations by
the participants. People interested can make groups of 2 or 3 people and
share the task of 2 or 3 presentations and the corresponding summaries.
Please send an e-mail (ptrianta@cs.chalmers.se,
tsigas@cs.chalmers.se) to have
a look at the reading material, have copies of it, discuss any questions/suggestions
you have and/or sign for an item.
-
Fault-tolerance in network-services: Resource allocation problems
-
Locking versus non-locking in shared memory multiprocessors:
* Scalable synchronization and different types of locking structures
* Counting (and balancing) networks
* Lock-free, wait-free synchronization (the snapshot problem as a case-study)
-
The consensus/agreement problem (a very important problem, because it is
the key of most synchronization/coordination/decision problems in distributed
systems)
* An overview of basic impossibility results in presence of failures
in systems with simple communication primitives; implications of these
results
* When and how the problem becomes solvable; study "sample" solutions
and their implications for fault-tolerant synchronization, both in multiprocessor
systems and in computer networks
* Implications of the above at the system design level (transactional
memory, ...)
-
The self-stabilizing paradigm for tolerance to transient failures in distributed
systems.
LITERATURE as planned per class (updated with
pointers -wherever possible):
The reading material is organized as a planned lecture(class)-wise break-up.
Wherever it is mentioned that the material is for 2 classes, the people
who are interested can make groups of 2 or 3 persons and share the task
of 2 or 3 presentations and the corresponding summaries. This will be good
and will help the preparation. If, however, you would prefer to work alone,
the material can be split accordingly.
Please send an e-mail -ptrianta@cs.chalmers.se,
tsigas@cs.chalmers.se) to have
a look at the reading material, have copies of it, discuss any questions/suggestions
you have and/or sign for an item.
Reference item, related -but not necessary- with some of the items
-
N. Lynch ``Distributed Algorithms'' (Morgan-Kaufmann, 1995)
1st Class
-
Copies of slides used for the introduction.
-
M. Raynal ``The mutual exclusion problem'', Chapter 5, MIT Press, 1987.
-
F. Mattern ``Logical
Time'' To appear in: P. Dasgupta, J. Urban (Eds.): Encyclopedia of
Distributed Computing, Kluwer, 1998.
2nd Class
-
K.M. Chandy and J. Misra ``The Drinking Philoshophers Problem'' ACM Transactions
on Programming Languages and Systems, Vol. 6, No. 4, October 1984, pp.
632-646.
-
Manhoi Choy and Ambuj Singh, ``Efficient Fault Tolerant Algorithms for
Resourse Allocation in Distributed Systems'' ACM TOPLAS, Vol. 17, No. 3,
May 1995, pp. 535-559.
-
M. Papatriantafilou, P. Tsigas ``On Distributed Resource Handling: Dining,
Drinking and Mobile Philoshophers'' Int'l Conf. on Principles of Distributed
Systems -- OPODIS '97, invited/keynote paper, pp. 293-308.
3rd, 4th Classes
-
Nir Shavit, ``Multiprocessor Synchronization'', Handouts 9-11, MIT Lecture
Notes, 1996.
-
John M. Mellor-Crummey and Michael L. Scott. Algorithms for scalable synchronization
on shared-memory multiprocessors. ACM Transactions on Computer Systems,
9(1):21-65, February 1991.
-
J. Aspnes, M. Herlihy, N. Shavit ``Counting
Networks'' Journal of the ACM 1(5): pp 1020--1048, Sept. 1994 (prel.
vers. in proceedings of ACM STOC 1991), pp. 348-358.
Additional, auxiliary papers
-
M. Herlihy , Lim, N. Shavit ``Scalable
Concurrent Counting'' ACM TOCS 13(4), 1995.
-
N. Shavit et. al. ``Diffracting
Trees'' ACM TOCS.
5th Class (work can be done in collaboration with the group working
with classes 6,7)
-
Nir Shavit, ``Multiprocessor Synchronization'', Handouts 4-5, MIT Lecture
Notes, 1996.
-
Y. Afek, H. Attiya, D. Dolev, E. Gafni, M. Merritt, N. Shavit ``Atomic
Snapshots of Shared Memory'' Journal of the ACM, Vol. 40, 4, Sept. 1993,
pp.873--890.
-
L. Kirousis, P. Spirakis and Ph. Tsigas ``Reading Many Variables in One
Atomic Operation: Solutions with Linear or Sublinear Complexity'' IEEE
Transactions on Parallel and Distributed Systems, 5(7), pp. 688-696, 1994.
6th, 7th Classes (report
in .doc.gz format)
-
Nir Shavit, ``Multiprocessor Synchronization'', Handouts 7-8, MIT Lecture
Notes, 1996.
-
M.P. Herlihy ``Wait-Free Sychronization'' ACM Transactions on Programming
Languages and Systems, 11 (1), pp. 124-149, 1991.
-
(time permitting)
Michael Fischer, Nancy Lynch and Michael Paterson ``Impossibility of
Distributed Consensus with One Faulty Process'' Journal of the ACM, Vol.
32, No. 2, April 1985, pp. 374-382.
-
D. Dolev, C. Dwork and L. Stockmeyer ``On the Minimal Synchronism Needed
for Distributed Consensus'' Journal of the ACM, 14(1), pp. 77-79, 1987.
8th, 9th Classes (report in
ps.gz format)
-
N. Lynch ``Distributed Algorithms'', Part of Chapter 21, (Morgan-Kaufmann,
1995)
-
M. Papatriantafilou and Ph. Tsigas ``Wait-Free Consensus in ``In-Phase''
Multiprocessor Systems'' Proceedings of the IEEE SPDP 1995, pp. 312-319.
-
M. Herlihy, J.E.B. Moss ``Transactional
Memory: Architectural Support for Lock-Free Data Structures'' In Proceedings
of the 1993 International Symposium in Computer Architecture, May 1993,
San Diego, CA.
-
(time permitting)
N. Shavit and D. Touitou. ``Software
Transactional Memory'' Proceedings of ACM PODC 1995, pages 204--213.
10th, 11th Classes (report
in ps.gz format)
-
Boaz Patt-Shamir, George Varghese ``Notes on Self-stabilization'' from
MIT Lecture Notes on Distributed Algorithms (course by Nancy Lynch), December
1992.
-
E.W. Dijkstra ``Self-stabilizing systems in spite of distributed control''
Comm. ACM 17, 11 (1974), 643--644.
-
S. Katz, K.J. Perry ``Self-Stabilizing extensions for message-passing systems''
Distributed Computing, Vol. 7, 1993, pp. 17--26.
-
(time permitting)
Baruch Awerbuch, Boaz Patt-Shamir, George Varghese ``Self-Stabilization
By Local Checking and Correction'' Proceedings of IEEE FOCS 1991, pp. 268-277.
-
(time permitting)
Baruch Awerbuch, George Varghese ``Distributed Program Checking: a
Paradigm for Building Self-Stabilizing Distributed Protocols'' Proceedings
of IEEE FOCS 1991, pp. 258-267.
OTHER RESOURCES:
-
Distributed Algorithms
& Systems Home : a page with pointers relevant to research
in distributed algorithms and systems: conference announcements and calls-for-papers,
pointers to institutes, projects, literature, bibliographies, special pages
describing a particular research area, and many more.
NEXT CLASS:
The course is held every Monday, 13:30-15:00
Seminar room S4
at the Computing
Science building
COMMUNICATION, ANNOUNCEMENTS
-
(timestamp: 980323)
This page serves as a shared-memory communication medium (i.e.
we write and you all can read, but *you* have to *go* to read it to see
added or modified items).
However,
it is also good to sometimes be informed using our message-passing
communication medium (i.e. have *the information* *come and interrupt*
us all when there is something new, i.e. use our e-mail systems).
Therefore, if you would like, please send an e-mail to ptrianta@cs.chalmers.se
and/or to tsigas@cs.chalmers.se
so that we can make a mailing list and
it is possible to broadcast something or have a discussion by e-mail
with the people who are ineterested when it is needed.
-
(timestamp: 980331)
Please send an e-mail -ptrianta@cs.chalmers.se,
tsigas@cs.chalmers.se) to arrange
for a meeting if you would like to have a look at the reading material,
have copies of it, discuss any questions/suggestions you have and/or sign
for an item.
ptrianta@cs.chalmers.se,
tsigas@cs.chalmers.se
-