Before I started my current job, I had played with SQL a bit, but had never done anything remotely advanced. Now, I'm sort of the defacto DBA, and so I've had to learn a great deal. One of the things I'm using more and more are UNIONs.
NOTE: The examples here are written using PostgreSQL. Your mileage may vary with other databases.
NOTE: These idioms will be nothing new to the SQL expert.
Time UNIONs
One common use of UNIONs is to combine tables that have data from separate times. For example, suppose that you summarize your apache access logs by day, and so you have something like
CREATE TABLE apache_summary_20071201 (
ip cidr,
url text,
user_agent text
);
CREATE TABLE apache_summary_20071202 (
ip cidr,
url text,
user_agent text
);
NOTE: I actually prefer to use ip4r, but the cidr type is built in.
Then you can create a VIEW to UNION them together.
CREATE OR REPLACE VIEW apache_summary AS
SELECT '2007-12-01'::date AS ymd, apache_summary_20071201.* FROM apache_summary_20071201
UNION ALL
SELECT '2007-12-02'::date AS ymd, apache_summary_20071202.* FROM apache_summary_20071202;
Now you can get all of the data at once. Plus, if you only need one day's worth of data, the query planner is smart enough to only fetch from the table you need if you supply a constant condition. Notice the One-Time Filter: false.
db=> EXPLAIN SELECT * FROM apache_summary WHERE ymd = '2007-12-01';
QUERY PLAN
---------------------------------------------------------------------------------------------
Append (cost=0.00..44.80 rows=1240 width=96)
-> Subquery Scan "*SELECT* 1" (cost=0.00..22.40 rows=620 width=96)
-> Seq Scan on apache_summary_20071201 (cost=0.00..16.20 rows=620 width=96)
-> Subquery Scan "*SELECT* 2" (cost=0.00..22.40 rows=620 width=96)
-> Result (cost=0.00..16.20 rows=620 width=96)
One-Time Filter: false
-> Seq Scan on apache_summary_20071202 (cost=0.00..16.20 rows=620 width=96)
(7 rows)
The advantage to this approach versus partitioning tables is that you don't have to store the date with each entry. Since every entry in the table has the same value, a VIEW works out nicely. The downside is that you can't just add another subquery to the VIEW, so you have to know what is there in order to recreate the view with an additional table. Partitioning makes that much easier.
LIMIT UNIONs
Sometimes when doing reports from a database, you want to get a representative sample for each item you are interested in. For example, imagine that you had a query to return the 10 most popular URLs from your apache_summary VIEW (an exercise left to the reader). Then suppose that you wanted to see the 3 most popular User Agents for each URL. The simple way would be to issue 10 queries, each looking something like
SELECT user_agent, count(*) AS ucount
FROM apache_summary
WHERE ymd = '2007-12-1'
AND url = 'something'
GROUP BY user_agent
ORDER BY ucount DESC
LIMIT 3;
In a background report, that's probably ok, but if this is a web page, the latency of doing 10 queries might really slow things down. Enter our friend, the UNION.
(
SELECT url, user_agent, count(*) AS ucount
FROM apache_summary
WHERE ymd = '2007-12-1'
AND url = 'something'
GROUP BY user_agent
ORDER BY ucount DESC
LIMIT 3
) UNION ALL (
SELECT url, user_agent, count(*) AS ucount
FROM apache_summary
WHERE ymd = '2007-12-1'
AND url = 'something_else'
GROUP BY user_agent
ORDER BY ucount DESC
LIMIT 3
) UNION ALL (
...
);
Now we can issue a single query to get all of the results, which will really help page load times.
Polymorphic Association UNIONs
Suppose you are using Ruby on Rails and you want to use Polymorphic Associations, but because you are database weenie, you don't want to give up your foreign key constraints.
Let's use the example from the link above. We have Addresses, that are shared between Users and Orders. Normally, we'd do something like
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name text
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
order_number text
);
CREATE TABLE addresses (
id SERIAL PRIMARY KEY,
city text,
country text,
addressable_id int,
addressable_type text
);
NOTE: Actually, I'd include UNIQUE constraints, use singular table names and include the table name in the primary key (e.g. user_id), but for simplicity, I'll do it like DHH.
Unfortunately, if something goes wrong (maybe with the DBA doing a fat finger), we can delete a row from users that still has a reference in addresses since the database can't enforce foreign key constraints. That might make things blow up.
So now, let's try this with a UNION. We'll leave users and orders well enough alone.
CREATE TABLE user_addresses (
id SERIAL PRIMARY KEY,
city text,
country text,
user_id int REFERENCES users(id)
);
CREATE TABLE order_addresses (
id SERIAL PRIMARY KEY,
city text,
country text,
order_id int REFERENCES orders(id)
);
CREATE VIEW addresses AS
SELECT id || ':user' AS id, city, country, 'User' AS addressable_type, user_id AS addressable_id
FROM user_addresses
UNION ALL
SELECT id || ':order' AS id, city, country, 'Order' AS addressable_type, order_id AS addressable_id
FROM order_addresses;
I should point out a couple of things.
- This will only work for read only access. You'd need to add a trigger
that would redirect INSERTs and UPDATEs to the appropriate table. Left as an exercise for the
reader.
- The
id field in the VIEW is not an integer. In my experience, Rails handles that just fine for
read only access, but chokes when adding a row. You'd have to define a custom primary key.
Left as an exercise for the reader.
I've used this last one to good effect because we have some data loading that occurs outside of rails, but I need to display the data on our web page. We get to keep foreign key constraints, and I get to use the power of rails. Double Win!