Friday, March 4, 2016

SQL: Counting Groups of Rows Sharing Common Column Values

In this post, I focus on using simple SQL SELECT statements to count the number of rows in a table meeting a particular condition with the results grouped by a certain column of the table. These are all basic SQL concepts, but mixing them allows for different and useful representations of data stored in a relational database. The specific aspects of a SQL query covered in this post and illustrated with simple examples are the aggregate function count(), WHERE, GROUP BY, and HAVING. These will be used to build together a simple single SQL query that indicates the number of rows in a table that match different values for a given column in that table.

I'll need some simple SQL data to demonstrate. The following SQL code demonstrates creation of a table called ALBUMS in a PostgreSQL database followed by use of INSERT statements to populate that table.

createAndPopulateAlbums.sql
CREATE TABLE albums
(
   title text,
   artist text,
   year integer
);

INSERT INTO albums (title, artist, year)
   VALUES ('Back in Black', 'AC/DC', 1980);
INSERT INTO albums (title, artist, year)
   VALUES ('Slippery When Wet', 'Bon Jovi', 1986);
INSERT INTO albums (title, artist, year)
   VALUES ('Third Stage', 'Boston', 1986);
INSERT INTO albums (title, artist, year)
   VALUES ('Hysteria', 'Def Leppard', 1987);
INSERT INTO albums (title, artist, year)
   VALUES ('Some Great Reward', 'Depeche Mode', 1984);
INSERT INTO albums (title, artist, year)
   VALUES ('Violator', 'Depeche Mode', 1990);
INSERT INTO albums (title, artist, year)
   VALUES ('Brothers in Arms', 'Dire Straits', 1985);
INSERT INTO albums (title, artist, year)
   VALUES ('Rio', 'Duran Duran', 1982);
INSERT INTO albums (title, artist, year)
   VALUES ('Hotel California', 'Eagles', 1976);
INSERT INTO albums (title, artist, year)
   VALUES ('Rumours', 'Fleetwood Mac', 1977);
INSERT INTO albums (title, artist, year)
   VALUES ('Kick', 'INXS', 1987);
INSERT INTO albums (title, artist, year)
   VALUES ('Appetite for Destruction', 'Guns N'' Roses', 1987);
INSERT INTO albums (title, artist, year)
   VALUES ('Thriller', 'Michael Jackson', 1982);
INSERT INTO albums (title, artist, year)
   VALUES ('Welcome to the Real World', 'Mr. Mister', 1985);
INSERT INTO albums (title, artist, year)
   VALUES ('Never Mind', 'Nirvana', 1991);
INSERT INTO albums (title, artist, year)
   VALUES ('Please', 'Pet Shop Boys', 1986);
INSERT INTO albums (title, artist, year)
   VALUES ('The Dark Side of the Moon', 'Pink Floyd', 1973);
INSERT INTO albums (title, artist, year)
   VALUES ('Look Sharp!', 'Roxette', 1988);
INSERT INTO albums (title, artist, year)
   VALUES ('Songs from the Big Chair', 'Tears for Fears', 1985);
INSERT INTO albums (title, artist, year)
   VALUES ('Synchronicity', 'The Police', 1983);
INSERT INTO albums (title, artist, year)
   VALUES ('Into the Gap', 'Thompson Twins', 1984);
INSERT INTO albums (title, artist, year)
   VALUES ('The Joshua Tree', 'U2', 1987);
INSERT INTO albums (title, artist, year)
   VALUES ('1984', 'Van Halen', 1984);

The next two screen snapshots show the results of running this script in psql:

At this point, if I want to see how many albums were released in each year, I could use several individual SQL query statements like these:

SELECT count(1) FROM albums where year = 1985;
SELECT count(1) FROM albums where year = 1987;

It might be desirable to see how many albums were released in each year without needing an individual query for each year. This is where using an aggregate function like count() with a GROUP BY clause comes in handy. The next query is simple, but takes advantage of GROUP BY to display the count of each "group" of rows grouped by the albums' release years.

SELECT year, count(1)
  FROM albums
 GROUP BY year;

The WHERE clause can be used as normal to narrow the number of returned rows by specifying a narrowing condition. For example, the following query returns the albums that were released in a year after 1988.

SELECT year, count(1)
  FROM albums
 WHERE year > 1988
 GROUP BY year;

We might want to only return the years for which multiple albums (more than one) are in our table. A first naive approach might be as shown next (doesn't work as shown in the screen snapshot that follows):

-- Bad Code!: Don't do this.
SELECT year, count(1)
  FROM albums
 WHERE count(1) > 1
 GROUP BY year;

The last screen snapshot demonstrates that "aggregate functions are not allowed in WHERE." In other words, we cannot use the count() in the WHERE clause. This is where the HAVING clause is useful because HAVING narrows results in a similar manner as WHERE does, but is used with aggregate functions and GROUP BY.

The next SQL listing demonstrates using the HAVING clause to accomplish the earlier attempted task (listing years for which multiple album rows exist in the table):

SELECT year, count(1)
  FROM albums
 GROUP BY year
HAVING count(1) > 1;

Finally, I may want to order the results so that they are listed in increasing (later) years. Two of the SQL queries demonstrated earlier are shown here with ORDER BY added.

SELECT year, count(1)
  FROM albums
 GROUP BY year
 ORDER BY year;
SELECT year, count(1)
  FROM albums
 GROUP BY year
HAVING count(1) > 1
 ORDER BY year;

SQL has become a much richer language than when I first began working with it, but the basic SQL that has been available for numerous years remains effective and useful. Although the examples in this post have been demonstrated using PostgreSQL, these examples should work on most relational databases that implement ANSI SQL.

No comments: