Databases
From Suhrid.net Wiki
Jump to navigationJump to searchContents
Relational Databases
- Set of named relations (tables)
- Each relation has a set of named attributes (columns)
- Each tuple (row) has a value for each attribute
- Each attribute has a type (or domain)
- All queries return relations (tables), so it can said to be compositional or closed.
XML Data
- Alternative to relational model to store data
- Standard for data representation and exchange
- Document format similar to HTML, however tags represent content instead of formatting.
- XML consists of tagged elements which can be nested, attributes of elements and the text (values/data)
- XML document can be thought of as a tree with values/text forming the leaves.
Relational vs XML
Relational | XML | |
---|---|---|
Structure | Tables | Hierarchical/Tree |
Schema | Fixed in advance | Flexible, self-describing. e.g. all elements need not have same attributes. (In relational, we would need to have NULL values). Inconsistencies in the structure is NOT a problem. |
Queries | Relatively simple | Complicated compared to Relational / SQL. |
Ordering | Fundamentally UNORDERED. e.g. Order by clause is required in SQL in case order is desired. | Ordering is implied - e.g. the way the text is laid out in the XML document or in the Stream. |
Implementation | Mature - native implementation | Add - on. (Underlying implementation is relational) |
Well formed XML
- Single root element
- XML parser is used to validate XML.
- DOM parser - treats XML as a document, SAX parser - treats XML as a stream.
- Matched tags, proper nesting
- Unique attributes within each element
Displaying XML
- Since XML occurs a lot on the internet, a convenient way is required to display XML.
- Use rule based languages such as CSS, XSL to translate XML to HTML.
- XML document is fed to a CSS/XSL interpreter alongwith rules and a HTML document is generated.
Valid XML
- Valid XML should adhere to requirements for well formed XML
- Also should adhere to content-specific specifications : DTD/XSD
- DTD is a grammar-like language for specifying elements, attributes, nesting, ordering, #occurrences
- DTD/XSD offers the benefits of "typing" - can serve as a specification. Useful for data exchange.
- Having no DTD offers flexibility of data.
- DTD also can specify pointer types - an element can have an ID and another element can refer to it using an IDREF.
XSD
- Like DTD, but more comprehensive.
- XSD is written in XML itself.
- Everything in DTD is string, in XSD e.g., we can specify that an attribute should be of a specific type.
- Pointers can be typed, so we can enforce that a reference is made to a particular type and not just global like in DTD's
- Also specify min/max constraints on elements
JSON
- Like XML, JSON can be thought of as a data model - as an alternative to the relational data model.
- Useful for semi-structured data
- Designed originally for serializing data objects
- JSON is human readable - often used for data interchange
- JSON consists of base values, objects (label-value pairs), and arrays to hold either. Structure is recursive.
- JSON vs Relational - JSON differs in how data is stored, schema is self-describing, ordering is present, query support is minimal unlike XQuery, XPath, XSLT, and is implemented typically in prog languages
- JSON vs XML - XML more verbose/complex than JSON, like DTD, XSD, JSON has JSON schema.
- XML model dont match with programming languages. Interface is clunky. This is called impedence mismatch - similar to Relational and prog languages.
- JSON has a more direct mapping with prog languages.
Valid JSON
- Adheres to basic structural requirements of JSON. (label-value pairs, arrays of values)
- Can use JSON schema for semantic validation.
Relational Algebra
- Relational algebra is the formal language that is the underpinning for implemented languages like SQL
- Query (expression) on a set of results produces relation as a result.
- Simplest query in RA is the name of the relation e.g. Student which produce the contents of the Student relation
- Operators to filter, slice and combine relations.
- Select operator : Sigma (cond) Relation name. Condition is a filter and the result will be a subset of the relation
- Project operator : Select picks rows, project picks columns. Pi (col name) Relation Name.
- To pick both rows and cols, combine the select and project operator. e.g. Project(sid,name) (Select(gpa > 3.7) student)
- Cross Product : By default RA is based on sets and doesnt consider duplicates. But SQL uses multisets/bags and includes duplicates - Cross product between two relations combines all attributes (columns) and every combination of tuples between the relation. m X n attrs and a X B tuples.
- Natural Join : Is like cross product but enforces equality on all attributes with the same name. Eliminate one copy of duplicate attributes (same attribute from diff table). So if we use the nat. join operator we dont need to specify the key attributes need to be equal condition. Doesnt provide any additional expressive power, but is convenient notationally.
- Theta Join : A convenience like natural join. Basically apply the theta condition to the cross product of two relations. This is the "join" that is implemented in most DBMS.
- Union operator : Used to produce a list.
- Difference operator : Here but not there.
- Intersection : Both here and there.
- Rename operator : Necessary to express certain queries. Takes a result of a schema and calls it something. This name can be used later on. Useful for unifying schemas for set operators (Relational Algebra) rule and for disambiguations in self-joins (joins on the same relation).
SQL
- Two parts - DDL (create,drop) DML (query and modify)
- SQL is declarative - what, not how. Therefore, query optimizer is very important.
- Selecting from multiple tables is a join, but unlike relational algebra, we need to specify the join condition manually.
- e.g. select a from person, city where person.lives = city.name
- SQL essentially follows an unordered model.
- Order on multiple cols can be specified. e.g order by age, population - age will be sorted and population will be sorted for a given age.
- Like RA, SQL has set operators.
- Union : Union eliminates duplicates, Union all includes duplicates
- Intersect : Some DB's dont support it, but can be simulated using join
- Except : (Set Difference). Again some db's dont support it.
- Subqueries : Useful for counting duplicates instead of join. Join can eliminate some required duplicates etc - The In clause comes in play. To simulate except - in A and not in B
- In uses to check whether values exist in the subquery or not. Exists can be used to check whether the result of the subquery is empty or not.
- Subqueries can refer to variables in the main query.
- Exists can be used to find largest/smallest. e.g. select name from student s1 where not exists (select * from student s2 where s2.gpa > s1.gpa) - all students such that there does not exist another student who's GPA is higher
- Any & all operator can be used to qualify a subclass. Any/all can be equally expressed using the exists clause.
- Subqueries can be used in the FORM and SELECT clauses apart from the WHERE clause. Subqueries in SELECT clause must just return one value.
Aggregation
- Perform computations over sets of values in multiple rows. Basic aggregation functions : min, max, avg, count. Part of select clause.
- With aggregate functions, we get two new clauses - group by and having. Group by allows to partition our relation into groups and will compute aggregate functions over group independently.
- Having allows to filter on the results of aggregate values. "where" applies to a single row, "having" applies to the groups that are generated in the group by clause
- count(*) also allows count(some col) - we can also do count(distinct col) which can be quite useful