Difference between revisions of "Databases"
From Suhrid.net Wiki
Jump to navigationJump to searchLine 184: | Line 184: | ||
* FD's are a generalization of the notion of keys. | * FD's are a generalization of the notion of keys. | ||
* FD's apart from being used in relational design are also useful in data storage : generating compression schemes and for query optimization. | * FD's apart from being used in relational design are also useful in data storage : generating compression schemes and for query optimization. | ||
+ | * example of FD. Two rows with same value of ssn have same name. For all t, u E Relation if t.ssn = u.ssn => t.name = u.name. (ssn -> name) | ||
+ | * So in general, A -> B. It can be extended to a set of attributes A1, A2, ... AN -> B1, B2 ... BM |
Revision as of 07:09, 1 February 2014
Contents
Relational Databases
- Set of named relations (tables)
- Each relation has a set of named attributes (columns)
- Each tuple (row) has a value for each attribute
- Each attribute has a type (or domain)
- All queries return relations (tables), so it can said to be compositional or closed.
XML Data
- Alternative to relational model to store data
- Standard for data representation and exchange
- Document format similar to HTML, however tags represent content instead of formatting.
- XML consists of tagged elements which can be nested, attributes of elements and the text (values/data)
- XML document can be thought of as a tree with values/text forming the leaves.
Relational vs XML
Relational | XML | |
---|---|---|
Structure | Tables | Hierarchical/Tree |
Schema | Fixed in advance | Flexible, self-describing. e.g. all elements need not have same attributes. (In relational, we would need to have NULL values). Inconsistencies in the structure is NOT a problem. |
Queries | Relatively simple | Complicated compared to Relational / SQL. |
Ordering | Fundamentally UNORDERED. e.g. Order by clause is required in SQL in case order is desired. | Ordering is implied - e.g. the way the text is laid out in the XML document or in the Stream. |
Implementation | Mature - native implementation | Add - on. (Underlying implementation is relational) |
Well formed XML
- Single root element
- XML parser is used to validate XML.
- DOM parser - treats XML as a document, SAX parser - treats XML as a stream.
- Matched tags, proper nesting
- Unique attributes within each element
Displaying XML
- Since XML occurs a lot on the internet, a convenient way is required to display XML.
- Use rule based languages such as CSS, XSL to translate XML to HTML.
- XML document is fed to a CSS/XSL interpreter alongwith rules and a HTML document is generated.
Valid XML
- Valid XML should adhere to requirements for well formed XML
- Also should adhere to content-specific specifications : DTD/XSD
- DTD is a grammar-like language for specifying elements, attributes, nesting, ordering, #occurrences
- DTD/XSD offers the benefits of "typing" - can serve as a specification. Useful for data exchange.
- Having no DTD offers flexibility of data.
- DTD also can specify pointer types - an element can have an ID and another element can refer to it using an IDREF.
XSD
- Like DTD, but more comprehensive.
- XSD is written in XML itself.
- Everything in DTD is string, in XSD e.g., we can specify that an attribute should be of a specific type.
- Pointers can be typed, so we can enforce that a reference is made to a particular type and not just global like in DTD's
- Also specify min/max constraints on elements
JSON
- Like XML, JSON can be thought of as a data model - as an alternative to the relational data model.
- Useful for semi-structured data
- Designed originally for serializing data objects
- JSON is human readable - often used for data interchange
- JSON consists of base values, objects (label-value pairs), and arrays to hold either. Structure is recursive.
- JSON vs Relational - JSON differs in how data is stored, schema is self-describing, ordering is present, query support is minimal unlike XQuery, XPath, XSLT, and is implemented typically in prog languages
- JSON vs XML - XML more verbose/complex than JSON, like DTD, XSD, JSON has JSON schema.
- XML model dont match with programming languages. Interface is clunky. This is called impedence mismatch - similar to Relational and prog languages.
- JSON has a more direct mapping with prog languages.
Valid JSON
- Adheres to basic structural requirements of JSON. (label-value pairs, arrays of values)
- Can use JSON schema for semantic validation.
Relational Algebra
- Relational algebra is the formal language that is the underpinning for implemented languages like SQL
- Query (expression) on a set of results produces relation as a result.
- Simplest query in RA is the name of the relation e.g. Student which produce the contents of the Student relation
- Operators to filter, slice and combine relations.
- Select operator : Sigma (cond) Relation name. Condition is a filter and the result will be a subset of the relation
- Project operator : Select picks rows, project picks columns. Pi (col name) Relation Name.
- To pick both rows and cols, combine the select and project operator. e.g. Project(sid,name) (Select(gpa > 3.7) student)
- Cross Product : By default RA is based on sets and doesnt consider duplicates. But SQL uses multisets/bags and includes duplicates - Cross product between two relations combines all attributes (columns) and every combination of tuples between the relation. m X n attrs and a X B tuples.
- Natural Join : Is like cross product but enforces equality on all attributes with the same name. Eliminate one copy of duplicate attributes (same attribute from diff table). So if we use the nat. join operator we dont need to specify the key attributes need to be equal condition. Doesnt provide any additional expressive power, but is convenient notationally.
- Theta Join : A convenience like natural join. Basically apply the theta condition to the cross product of two relations. This is the "join" that is implemented in most DBMS.
- Union operator : Used to produce a list.
- Difference operator : Here but not there.
- Intersection : Both here and there.
- Rename operator : Necessary to express certain queries. Takes a result of a schema and calls it something. This name can be used later on. Useful for unifying schemas for set operators (Relational Algebra) rule and for disambiguations in self-joins (joins on the same relation).
SQL
- Two parts - DDL (create,drop) DML (query and modify)
- SQL is declarative - what, not how. Therefore, query optimizer is very important.
- Selecting from multiple tables is a join, but unlike relational algebra, we need to specify the join condition manually.
- e.g. select a from person, city where person.lives = city.name
- SQL essentially follows an unordered model.
- Order on multiple cols can be specified. e.g order by age, population - age will be sorted and population will be sorted for a given age.
- Like RA, SQL has set operators.
- Union : Union eliminates duplicates, Union all includes duplicates
- Intersect : Some DB's dont support it, but can be simulated using join
- Except : (Set Difference). Again some db's dont support it.
- Subqueries : Useful for counting duplicates instead of join. Join can eliminate some required duplicates etc - The In clause comes in play. To simulate except - in A and not in B
- In uses to check whether values exist in the subquery or not. Exists can be used to check whether the result of the subquery is empty or not.
- Subqueries can refer to variables in the main query.
- Exists can be used to find largest/smallest. e.g. select name from student s1 where not exists (select * from student s2 where s2.gpa > s1.gpa) - all students such that there does not exist another student who's GPA is higher
- Any & all operator can be used to qualify a subclass. Any/all can be equally expressed using the exists clause.
- Subqueries can be used in the FORM and SELECT clauses apart from the WHERE clause. A sub query in the from clause allows you to write a select from where expression
and then use the result of that expression as if it were an actual table in the database. Subqueries in SELECT clause must just return one value.
- Null values in DB can produce unexpected results - distinct does include null, but count of distinct does NOT. Watch out.
Aggregation
- Perform computations over sets of values in multiple rows. Basic aggregation functions : min, max, avg, count. Part of select clause.
- With aggregate functions, we get two new clauses - group by and having. Group by allows to partition our relation into groups and will compute aggregate functions over group independently.
- Having allows to filter on the results of aggregate values. "where" applies to a single row, "having" applies to the groups that are generated in the group by clause
- count(*) also allows count(some col) - we can also do count(distinct col) which can be quite useful
- "Group by" is used only in conjunction with aggregation. - Grouping takes a table and partitions it by values of a given attribute or a set of attributes. e.g. group by sName will partition student according to names. select count(*) from Student group by sname - counts of groups of students eaching having the same name.
- A query with group by/having can be rewritten as a complicated query without those clauses.
Modification
- Insert into table values OR
- Insert into table (Select statement) : result of the select will be inserted into the table
- Delete from table (condition) : condition can be a subquery : but some db systems dont fully support this where the subquery references the same table
- Update table set attr = expre where condition, (need not have a where clause also)
Join
- In from clause when we list multiple tables, it is implicitly a cross product of the tables.
- Inner join on condition : Equivalent to theta join in RA. Cross product with a condition. Default join operator. So "inner" keyword is not necessary. e.g. select x from a join b on a.someid = b.otherid
- Natural join : Elimnates duplicates in columns with the same name. Therefore, no need to specify "on" clause. select sName from student natural join apply
- Inner join with Using clause (using specific attributes) : This is better practice than using natural join to actually specify the attribute to join on.
- Outer join : left / right / full. To add tuples that dont match the join condition. outer can be left out. So only left join is ok.
- Left : any tuples on the left with no matching tuple on right will be added. Outer join can also be written as a natural join if the using clause is not specified.
- Right : vice-versa
- Full : unmatched tuples from both left & right. Basically it is a left outer join UNION right outer join
- Note: the outer join is NOT associative. A ojoin (B ojoin C) <> (A ojoin B) ojoin C
- Join ordering offers hint to SQL query executor on how to execute joins. So tweaking join ordering can be used to influence query performance.
Relational Design Theory
- Design anomalies
- Redundancy : Duplicate information
- Update anomaly : Direct effect of redundancy. Update facts in some places but NOT ALL. Or update differently in different places - we end up with an inconsistent database.
- Deletion anomaly : Allowing a complete deletion about something. e.g. if surfing is a hobby and stored alongwith student, if we delete surfing, then we delete those students whose only hobby is surfing.
- The best design for a relational DB, depends not only about constructing the relations well, but also on what the data is representing in the real world.
- Design by decomposition
- Start with mega relations that contain everything
- Decompose into smaller relations with same info
- Automatic decomposition can be done - "Mega" relations + properties
- System uses properties of the data to perform decomposition.
- Final set of relations satisfies a normal form - i.e. no anomalies present and NO lost data.
- Functional dependencies - System generates Boyce Codd Normal Form
- FD + Multivalued dependencies - Fourth Normal Form.
- A functional dependency A -> B, means A functionally determines B. e.g. SSN -> Name. An SSN always maps to the same name, but not necessarily the other way around. This relationship should be stored once.
It is a 1-1 mapping. There can be only 1 B value given an A.
- What BCNF says is that when we have a functional dependency, the left side of the FD (i.e. A) must be a KEY.
- Multi-valued dependencies : A ->> B. Take the relation Apply(ssn, cname, hs) : This is in BCNF, becaue there are no FD's. But there is redundancy, for every cname, hs, we'll have c*h tuples.
There is a MVD from SSN to CName, and HS. SSN->>CName, SSN->>HS. i.e. have every combination of those two attributes and values in those attributes for a given social security number. cname and HS are essentially independent. So store them once. So 4NF says if A->>B, then A should be a key. So Apply(ssn, cname, hs) can be decomposed into Apply(ssn, cname) and HighSchool(ssn, hs). So we get c+h tuples rather than c*h tuples.
Functional Dependencies
- FD's are a generalization of the notion of keys.
- FD's apart from being used in relational design are also useful in data storage : generating compression schemes and for query optimization.
- example of FD. Two rows with same value of ssn have same name. For all t, u E Relation if t.ssn = u.ssn => t.name = u.name. (ssn -> name)
- So in general, A -> B. It can be extended to a set of attributes A1, A2, ... AN -> B1, B2 ... BM