Advanced Topics in Computer Systems |
Lectures 1 & 2
|
Joe Hellerstein & Anthony Joseph |
|
System R & DBMS Overview
DBMS History
- late 60's: network (CODASYL) & hierarchical (IMS) DBMS.
Charles Bachman: Turing award #8 (1973, between Dijkstra and Knuth!)
- Low-level ``record-at-a-time'' DML, i.e. physical data structures
reflected in DML (no data independence)
- 1970: Codd's paper. The most influential paper in DB research. Set-at-a-time
DML. Data independence. Allows for schema and physical storage structures
to change under the covers. Truly important theory, led to "paradigm shift"
in thinking and in practice. (Papadimitriou: "as clear a paradigm
shift as we can hope to find in computer science"). Edgar F. Codd: Turing
award #18 (1981, between Hoare and Cook).
- Focus on separation into 3 levels: physical storage, logical schema,
and multiple views.
- Results in two kinds of independence: physical data independence,
and logical data independence
- Physical allows you change the storage layout without affecting
apps. DBMS makes right.
- Logical encapsulates apps from changes in logical schema.
DBA makes right (except in simple cases, when DBMS makes right).
- CRITICAL for database evolution -- and note that databases live
and evolve for a LONG time!
- early-to-mid-70's: raging debate between the two camps. "great debate"
in 1975
- mid 70's: 2 full-function (sort of) prototypes. Ancestors of essentially
all today's commercial systems
- Ingres
: UCB 1974-77
- a ``pickup team'', including Stonebraker & Wong. early and pioneering.
begat Ingres Corp (CA), CA-Universe, Britton-Lee, Sybase, MS SQL Server, Wang's
PACE, Tandem Non-Stop SQL.
- System R
: IBM San Jose (now Almaden)
- 15 PhDs. begat IBM's SQL/DS & DB2, Oracle, HP's Allbase, Tandem
Non-Stop SQL. System R arguably got more stuff ``right'', though there was
lots of information passing between both groups
- Jim Gray: Turing Award #22 (1998)
- Lots of Berkeley folks on the System R team, including Gray (1st
CS PhD @ Berkeley), Bruce Lindsay, Irv Traiger, Paul McJones, Mike Blasgen,
Mario Schkolnick, Bob Selinger , Bob Yost. See
http://www.mcjones.org/System_R/SQL_Reunion_95/sqlr95-Prehisto.html#Index71.
- Both were viable starting points, proved practicality of relational
approach. Beautiful example of theory -> practice!!
- ACM Software Systems award #6 shared by both
- Interestingly, Stonebraker received ACM SIGMOD Innovations Award
#1 (1991), Gray #2 (1992).
- early 80's: commercialization of relational systems
- Ellison's Oracle beats IBM to market by reading white papers.
- IBM releases multiple RDBMSs, settles down to DB2. Gray (System
R), Jerry Held (Ingres) and others join Tandem (Non-Stop SQL), Kapali Eswaran
starts EsVal, which begets HP Allbase and Cullinet
- Relational Technology Inc (Ingres Corp), Britton-Lee/Sybase, Wang
PACE grow out of Ingres group
- CA releases CA-Universe, a commercialization of Ingres
- Informix started by Cal alum Roger Sippl (no pedigree to research).
- Teradata started by some Cal Tech alums, based on proprietary networking
technology (no pedigree to software research, though see parallel DBMS discussion
next semester!)
- mid 80's: SQL becomes "intergalactic standard''.
- DB2 becomes IBM's flagship product.
- IMS "sunseted''
- today: network & hierarchical are legacy systems (though commonly
in use!)
- relational commoditized -- Microsoft, Oracle and IBM fighting over
much of the market. IBM just bought Informix, to become solid #2. Sybase/NCR
Teradata and a couple others vying to survive on the fringes
- Computer Associates bought Ingres and largely killed it
- Oracle bought DEC Rdb and largely killed it
- Informix bought Illustra (commercial Postgres), incorporated it.
- IBM bought Informix, will probably incorporate but mostly sunset
it.
- Microsoft bootstrapped by buying code to Sybase SQL Server (hence
Ingres pedigree)
- "object-relational" is mainstream
- variant of relational we'll talk about more next spring
- think relational + extensible types and downloadable code + rule
systems + non-normalized data
- Informix, IBM, Oracle, Sybase all claim to sell it today.
Microsoft in "the next version".
Relational System Architecture
Databases are BIG pieces of software. Typically very hard to modularize.
Lots of system design decisions at the macro and micro scale. We will focus
mostly on micro decisions -- and hence ideas reusable outside DBMSs -- in
subsequent lectures. Here we focus on macro design.
Disk management choices:
- file per relation
- big file in file system
- raw device
Process Model:
- process per user
- server
- multi-server
Basic modules:
- parser
- query rewrite
- optimizer
- query executor
- access methods
- buffer manager
- lock manager
- log/recovery manager
Query Rewriter
- Flattens views (why?)
- may change query semantics (constraints, protection, etc.)
Optimizer
- large space of equivalent relational plans
- pick one that's going to be "optimal" (?)
- produces either an interpretable plan tree, or compiled code
Executor
- modules to perform relation operations like joins, sorts, aggregations,
etc.
- calls Access Methods for operations on base and temporary relations
Access Methods
- uniform relational interface (open, get next), a la INGRES AMI,
System R's RSS
- multiple implementations: heap, B-tree, extensible hashing
Buffer Manager
- Intelligent user-level disk cache
- must interact with transaction manager & lock manager
- Virtual memory does not cut it! (we'll discuss this at length)
Lock Manager
- must efficiently support lock table
- System R architecture influential:
- physical and logical locks treated uniformly
- multiple granularity of locks
- set intent locks at high levels
- we will study this in more detail later (Gray)
- deadlock handling: detection
Log/Recovery Manager
- "before/after" log on values
- checkpoint/restore facility for quick recovery
- Redo/Undo on restore
- Support soft crashes off disk, hard crashes off tape.
- System R?s shadowing is too slow. Use Write-Ahead Logging! (WAL)
More on this to come.
- Hard to get right!
Notes on INGRES
What of today's systems comes from Ingres?
- query rewrite, especially for views and integrity constraints
- use of UNIX -- bleeding edge system!
- catalogs as relations
- idea of extensible access methods (we'll come back to this later
in the semester)
- thoughts on strange-length transactions
- interpreted query plans (?)
- simple, clear language is a good thing
Things it got wrong:
- static trees & hashing
- interpreted query plans (?)
- file/relation
- hill-climbing, adaptive optimizer (though this is coming back!
wait for "eddies")
- process(es)/user
- locking: column-level (actually table). Deadlock avoidance.
- logging/recovery
- argued that disk clustering wasn't important
Notes on System R
See the System R
reunion notes
for fun background and gossip.
More influential? A lot of the ideas remain in today?s systems:
- optimizer remains, largely unchanged
- RSS/RDS divide remains in many systems
- SQL, cursors, duplicates, NULLs, etc.
- the pros and cons of duplicates. Alternatives?
- pros and cons of NULLs. Alternatives?
- grouping and aggregation
- updatable single-table views
- begin/end xact at user level
- savepoints and restore
- catalogs as relations
- flexible security (GRANT/REVOKE)
- integrity constraints
- triggers (!!)
- clustering
- compiled queries
- B-trees
- Nest-loop & sort-merge join, all joins 2-way
- dual logs to support log failure
Stuff they got wrong:
- shadow paging
- predicate locking
- SQL language
- duplicate semantics
- subqueries vs. joins
- outer join
- rejected hashing
Database View of Applications
Big, complex record-keeping applications like SAP and PeopleSoft, which run
over a DBMS. "Enterprise applications" to keep businesses humming.
A smattering:
- ERP: Enterprise Resource Planning (SAP, Baan, PeopleSoft, Oracle,
IBM, etc.)
- CRM: Customer Relationship Management (E.phiphany, Siebel, Vantive,
Oracle, IBM, etc.)
- SCM: Supply Chain Management (Trilogy, i2, Oracle, IBM, etc.)
- Human Resources, Direct Marketing, Call Center, Sales Force Automation,
Help Desk, Catalog Management, etc.
- Many e-business versions of order-entry/procurement and the above
(i.e. web serving packages for this)
- Integrated B2B versions of all this rolling as fast as possible
A main job of DBMS is to make these kinds of apps easy to write
OS and DBMS: Philosophical Similarities & Differences
- UNIX paper: "The most important job of UNIX is to provide a file system".
- UNIX and System R are both "information management" systems!
- both also provide programming APIs for code
- Difference in focus: Bottom-Up (elegance of system) vs. Top-Down (elegance
of semantics)
- main goal of UNIX was to provide a small elegant set of mechanisms,
and have programmers (i.e. C programmers) build on top of it. As an
example, they are proud that "No large 'access method' routines are required
to insulate the programmer from system calls". After all, OS viewed
its role as presenting hardware to computer programmers.
- main goal of System R and Ingres was to provide a complete system
that insulated programmers (i.e. SQL + scripting) from the system, while
guaranteeing clearly defined semantics of data and queries.
After all, DBMS views its role as managing data for application programmers.
- Affects where the complexity goes!
- to the system, or the end-programmer?
- question: which is better? in what environments?
- follow-on question: are internet systems more like enterprise
apps (traditionally built on DBMSs) or scientific/end-user apps (traditionally
built over OSes and files)? Why?
- Achilles' heel of RDBMSs: a closed box
- Cannot leverage technology without going through the full SQL stack
- One solution: make the system extensible, convince the world to
download code into the DBMS
- Another solution: componentize the system (hard, due to transaction
semantics)
- Achilles' heel of OSes: hard to decide on the "right" level of abstraction
- As we'll read, many UNIX abstractions (e.g. virtual memory) hide
*too* much detail, messing up semantics. On the other hand, too low
a level can cause too much programmer burden, and messes up the elegance
of the system
- One solution: make the system extensible, convince the fancy apps
to download code into the OS
- Another solution: componentize the system (hard, due to protection
issues)
- Traditionally separate communities, despite subsequently clear need
to integrate
- UNIX paper: "We take the view that locks are neither necessary nor
sufficient, in our environment, to prevent interference between users of
the same file. They are unnecessary because we are not faced with large,
singleĀfile data bases maintained by independent processes."
- System R: "has illustrated the feasibility of compiling a very high-level
data sublanguage, SQL, into machine-level code".
So, a main goal of this class is to work from both of these directions, cull
the lessons from each, and ask how to use these lessons today both within
and OUTSIDE the context of these historically separate systems.