Thursday, June 28, 2012

De-duplication and Normalisation - A Practical Example

De-duplication and Normalisation - A Practical Example


My post today is about getting a distinct set of values from a table where you also want to pull back other, unrelated column data which may or may not be distinct, and about normalising your data into a better table set to avoid this problem.
Let's say, for example, you have the following table structure:

CREATE TABLE dbo.library (
         LibraryID INT IDENTITY(100,5) IDENTITY PRIMARY KEY NOT NULL,
         ISBN VARCHAR(12),
         Publisher VARCHAR(50),
         BookTitle VARCHAR(100),
         Author VARCHAR(100) ,
         YearPublished INT )

Let's populate the table using Red Gate SQL Data Generator, 100,000 rows, all unique values except YearPublished.


Now we need to set the table up with some deliberate duplication.  If this was a real production DB you might find multiple books in the table with different LibraryID values.  To simulate this I will run the following query:

SELECT TOP 20 PERCENT * INTO #tempLibraryTable FROM dbo.library;
INSERT INTO dbo.library (ISBN, Publisher, BookTitle, Author, YearPublished)
SELECT ISBN, Publisher, BookTitle, Author, YearPublished
FROM #tempLibraryTable;
DROP TABLE #tempLibraryTable;

We now have the library table populated with 20% duplication.  This means 20% of books are entered twice, but with different LibraryIDs.


The question is how best we would select a distinct set of values from this list, based on just one column?  Or, in English, can we get a distinct list of book titles (i.e. devoid of duplicates) but also include information from columns with more than one value for our key column (duplicate LibraryIDs, for example)?  Because this is a one-to-many relationship, we will only be able to get the first LibraryID value for each duplicate book - any others must be discarded, otherwise the query becomes a list including duplicates, not excluding them.


At first glance we might think SELECT DISTINCT.  However SELECT DISTINCT is inadequate, since the DISTINCT keyword applies a distinct filter across the whole row, not just one column.  So, doing a SELECT DISTINCT on the dataset above returns all 120,000 rows since the LibraryID (unique identity int) is unique (distinct) in each case, distorting the 'uniqueness' of the row as a whole.


Here's my first solution, stripping away the LibraryID column to leave a heap, de-duplicating the heap, and re-inserting the data into the original table:

SET IDENTITY_INSERT dbo.Library ON
SELECT ISBN, Publisher, BookTitle, Author, YearPublished
INTO #tempHoldingTable
FROM dbo.Library
SELECT * INTO dbo.Library_Backup FROM dbo.Library
TRUNCATE TABLE dbo.Library
SET IDENTITY_INSERT dbo.Library OFF
INSERT INTO dbo.Library
SELECT DISTINCT *
FROM #tempHoldingTable

What's wrong with this solution?  Nothing, in that it achieves the aims required, however there are three concerns:


 1) Efficiency - This query writes out 120,000 rows to tempdb, then 100,000 rows back again.
 2) New LibraryIDs - the original LibraryIDs are not preserved, new IDENTITY values are assigned.
 3) Uniqueness on other columns - this solution will fail if other column data varies for one key, not just id.


Issue 1) is going to be a problem if this is a regular event, or the number of rows is large (millions), or tempdb is space-constrained or on an inefficient volume (RAID 5).  Issue 2) is more serious, since it's quite feasible that LibraryID is a foreign key for some other table.  Breaking this constraint and changing the key-value pair of ID/BookTitle would destroy referential integrity.  Issue 3) complicates matters if, e.g. we have multiple rows showing multiple YearPublished for one BookTitle, or multiple ISBNs.


What if we simply selected one row for each subset of duplicate rows? I.e. here is our row info:


LibraryIDISBNPublisherBookTitleAuthorYearPublished
125143223534Little, BrownThe Complete Pie GuideJ. Hoffman1999
160143223534Little, BrownThe Complete Pie GuideJ. Hoffman1999
255143223534Little, BrownThe Complete Pie GuideJ. Hoffman1999










We could do:

SELECT MIN(libraryID), lib.ISBN, lib.Publisher, lib.BookTitle, lib.Author, lib.YearPublished
FROM dbo.Library lib INNER JOIN (
SELECT BookTitle FROM dbo.Library ) AS books
ON lib.BookTitle = books.BookTitle
GROUP BY lib.ISBN, lib.Publisher, lib.BookTitle, lib.Author, lib.YearPublished

This is a bit better, although we are removing potentially valuable information if other columns than LibraryID contain varying data, we are achieving our aim of streamlining the data set.

CAUTION:  This is NOT a good idea if you need to keep all of the data!  For example, using the solution above will remove all duplicate records where the book title is repeated, taking the lowest LibraryID as the record to keep.  This is not what you want if you need to record a row for e.g. multiple imprints of the same book (multiple YearPublished, ISBN).

Another way, using a non-recursive CTE:

;WITH getBooks(BookTitle) AS
( SELECT BookTitle
 FROM dbo.Library
)
SELECT  MIN(libraryID), lib.ISBN, lib.Publisher, lib.BookTitle,
  lib.Author, lib.YearPublished
FROM  dbo.Library lib INNER JOIN getBooks books
ON  lib.BookTitle = books.BookTitle
GROUP BY lib.ISBN, lib.Publisher, lib.BookTitle, lib.Author, lib.YearPublished

This achieves the same result, 100,000 rows.  It's worth noting that you can substitute MIN with MAX if you wanted to get, for example, the latest entry with duplicate info.  You can also substitute out the key value (in this case, BookTitle) with any other value.  Instead of stripping out multiple book titles, you may wish to substitute out multiple ISBNs instead.  You can even substitute multiple values if you wanted to strip out all duplicate rows where the book title AND ISBN are duplicated.

Ideally of course we would normalise our data and in the process of doing so, remove duplicates (referential integrity and foreign key constraints simply wouldn't allow them).

The first step on the road to normalisation is to work out the requirements, and the relationships between the data.  I like to do this by thinking about the nature of the data itself, and not just the attributes of it (like length, datatype).

It is clear that in our example libraryID should be unique.  There should be one and only one ISBN per book, and each ISBN is associated with just one book and no more (one-to-one relationship).  There is only one publisher per book, but a publisher can have multiple books (one-to-many relationship).  A book can have one title only, but the same title might apply to multiple books (one-to-many).  There can be multiple authors for a book, and an author can have multiple books (many-to-many).  Finally, a book can be published in only one year (single edition - subsequent editions will vary on ISBN, perhaps publisher) and in one year, multiple books can be published (one-to-many relationship).

It is clear ISBN is a natural primary key, there being only one ISBN per book entity and one book entity per ISBN.  So let's bin the surrogate primary key LibraryID and use ISBN instead.  (For the purposes of this experiment, every book has an ISBN with no NULL values, and ISBNs are integer numbers - no leading zeroes).  By normalising a little, we can add additional information about each column for each book while actually reducing the amount of information in the central table.  Imagine we had this structure (included FK information for easy reference):

CREATE TABLE dbo.library (
ISBN   INT PRIMARY KEY NOT NULL,
PubID  INT FOREIGN KEY REFERENCES dbo.Publishers(PubID) NULL,
BookTitle VARCHAR(MAX) NOT NULL,
AuthorID INT FOREIGN KEY REFERENCES dbo.Authors(AuthorID) NOT NULL,
YearPublished INT NULL )
CREATE TABLE dbo.Publishers (
PubID  INT PRIMARY KEY NOT NULL,
PubName  VARCHAR(MAX) NOT NULL,
YearEstablished INT NULL,
StreetAddr VARCHAR(MAX) NULL,
logo  VARBINARY(MAX) NULL,
PhoneNumber VARCHAR(50) NULL )
CREATE TABLE dbo.Authors (
AuthorID INT PRIMARY KEY NOT NULL,
AuthorName VARCHAR(MAX) NOT NULL,
Gender  BIT NULL,
DateOfBirth DATETIME NULL )

However, we've immediately got a problem.  What if a book has many authors?  We cannot add in another row for the book with the same ISBN since our primary key constraint prohibits this.  If we list the author as an amalgamation of authors (John Smith and Eddy Jones) we break the first rule of first normal form (1NF).  One solution would be to have a table called 'AuthorGroups' which would list each known combination of authors, pivoting the author names into separate columns, for a maximum of, let's say, five authors:

CREATE TABLE dbo.AuthorGroups (
GroupID  INT PRIMARY KEY NOT NULL,
AuthorOne VARCHAR(100) NOT NULL,
AuthorTwo VARCHAR(100) NULL,
AuthorThree VARCHAR(100) NULL,
AuthorFour VARCHAR(100) NULL,
AuthorFive VARCHAR(100) NULL )

We then amend the dbo.Authors table:

ALTER TABLE dbo.Authors
ALTER COLUMN AuthorName VARCHAR(MAX) NULL
ALTER TABLE dbo.Authors
ADD IsGroup BIT NOT NULL,
ALTER TABLE dbo.Authors
ADD GroupID INT FOREIGN KEY REFERENCES dbo.AuthorGroups(GroupID) NULL

However we then have a lot of wasted NULL values, since in dbo.Authors every AuthorID that is not a singleton (i.e. a group) will have NULL values for AuthorName, Gender and DateOfBirth.  We are also unable to capture that latter information for groups of authors.  We also have duplication if authors are put into the AuthorGroups table where they exist in Authors, and potential duplication since the rows in AuthorGroups will be unique even if they have the same authors in a different column order.  Also, of course, we're limited to just five authors.  On a technical reference manual there are often many authors (SQL Server MVP Deep Dives Volume II, to take a random example, has 7 editors and 61 authors).
Let's approach it a different way.  We'll create the following table instead:

CREATE TABLE dbo.AuthorGroups (
RowID INT IDENTITY PRIMARY KEY NOT NULL,
GroupID INT NOT NULL,
AuthorID INT FOREIGN KEY REFERENCES dbo.Authors(AuthorID) NOT NULL )

And we'll amend dbo.Library by dropping and recreating it as:

CREATE TABLE dbo.library (
ISBN   INT PRIMARY KEY NOT NULL,
PubID  INT FOREIGN KEY REFERENCES dbo.Publishers(PubID) NULL,
BookTitle VARCHAR(MAX) NOT NULL,
AuthorID INT FOREIGN KEY REFERENCES dbo.AuthorGroups(GroupID) NOT NULL,
YearPublished INT NULL )

This is better, since the AuthorGroups table can contain singletons (one GroupID mapped to one AuthorID) for individuals, and have as many author entries as required for each GroupID, which is referenced in our central dbo.Library table.

In terms of migrating the data from our old single table to our new schema, some work is still required.  We must:

1) Identify instances where there are multiple ISBNs in the original dbo.library table and de-duplicate these rows.
2) Select every distinct publisher and put them into dbo.Publishers.
3) Select every distinct author, string-splitting concatenated values, and put them into dbo.Authors.
4) Identify every concatenate author value and create an entry in AuthorGroups for them, cross referencing dbo.Authors.
5) Migrate the bulk of the data into dbo.Library.

Let's take these tasks one at a time.  We can do 1), since this is just a rehash of our CTE above, using ISBN instead of BookTitle.  However let's change it slightly to have multiple conditions on the inner join so that we can avoid picking up rows with incomplete data.  We will need to change the CTE to pick up multiple columns which we are certain are not duplicates for the key ISBN value.  Below, we're using BookTitle for this, since it's a reasonable assumption that there will not be multiple titles for one ISBN nor multiple ISBNs for one title.  This will also prevent us saving rows which have an ISBN with NULL values for the rest of the columns (accidental data entry, perhaps, might cause this). 

Be aware that if you pick columns which do, in fact, have different values for the same key ISBN value, you'll get an error later when attempting to insert the ISBN data into the new, primary-key-constrained table.

;WITH getISBNs(ISBN, BookTitle) AS
( SELECT ISBN, BookTitle
FROM dbo.Library
)
SELECT  MIN(libraryID), lib.ISBN, lib.Publisher, lib.BookTitle,
lib.Author, lib.YearPublished
INTO   dbo.Library_replacement
FROM  dbo.Library lib INNER JOIN getISBNs isbns
ON  lib.ISBN = isbns.ISBN
AND  lib.BookTitle = isbns.BookTitle
GROUP BY lib.ISBN, lib.Publisher, lib.BookTitle, lib.Author, lib.YearPublished

Task 2) is easy too, with a simple SELECT DISTINCT.  However let's modify the Publishers table to have an IDENTITY column as PubID, so that the inserts are straightforward and we can guarantee a unique value.  Then we simply:

SELECT DISTINCT Publisher INTO dbo.Publishers FROM dbo.Library;

For Task 3), we'll again modify the table definition so that AuthorID is an IDENTITY column.  But we have to string split based on the comma delimiter.
 
The logic of the split is as follows -

For each 'author' value in each row returned by SELECT DISTINCT Author FROM dbo.Library:


Determine if it is a composite value (multiple authors).


 If YES -
   a) determine how many elements (authors) are in the composite value.
   b) insert one new row in dbo.Authors for each composite element

 If NO -
   a) Insert the author into the dbo.Authors table.


However, writing an efficient string splitter is rather tricky.  Here, I'm going to use the following tools - CHARINDEX, which searches a string or expression for another string or expression and returns the starting position if found; SUBSTRING, which returns part of a string given starting position and length, and LEN, which returns the number of characters in a given expression.  I'm also going to borrow heavily from Jeff Moden's string splitter code, available here: http://www.sqlservercentral.com/articles/Tally+Table/72993/ which uses a numbers table to efficiently cleave strings by delimiter.

First, I'll extract the author information:

SELECT DISTINCT author INTO #allSingletonAuthors FROM dbo.Library
WHERE author NOT LIKE ('%,%');  -- this will build a list of all single authors.
SELECT DISTINCT author INTO #allMultipleAuthors FROM dbo.Library
WHERE author LIKE ('%,%'); -- this will build a list of all multiple authors.

Now let's use Jeff's string splitter function.  First, we have to define it:

CREATE FUNCTION dbo.DelimitedSplit8K
 --===== Define I/O parameters
         (@pString VARCHAR(8000), @pDelimiter CHAR(1))
 RETURNS TABLE WITH SCHEMABINDING AS
  RETURN
 --===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
      -- enough to cover VARCHAR(8000)
   WITH E1(N) AS (
                  SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
                  SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
                  SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                 ),                          --10E+1 or 10 rows
        E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
        E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
  cteTally(N) AS (--==== This provides the "zero base" and limits the number of rows right up front
                      -- for both a performance gain and prevention of accidental "overruns"
                  SELECT 0 UNION ALL
                  SELECT TOP (DATALENGTH(ISNULL(@pString,1))) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
                 ),
 cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
                  SELECT t.N+1
                    FROM cteTally t
                   WHERE (SUBSTRING(@pString,t.N,1) = @pDelimiter OR t.N = 0)
                 )
 --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
  SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY s.N1),
         Item       = SUBSTRING(@pString,s.N1,ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1,8000))
    FROM cteStart s
 ;

Now let's call the function, passing in the appropriate value from #allMultipleAuthors:

INSERT INTO dbo.Authors
SELECT split.Item
FROM #allMultipleAuthors ama
CROSS APPLY dbo.DelimitedSplit8k(ama.author,',') split
ORDER BY split.Item ASC

Thanks to the string splitter, each of the delimited values will be placed into the dbo.Authors table.  Note of warning - if this concatenated field includes spaces, these spaces will also be inserted.  To fix this, trim the string before insertion.

Task 4) can be accomplished in much the same way.  dbo.AuthorGroups has the RowID surrogate primary key which is an IDENTITY column.  We'll call the string splitter again.  But this time we're going to add in the GroupID and AuthorID columns using the RANK and DENSE_RANK functions as follows:

SELECT
DENSE_RANK() OVER (ORDER BY ama.author) AS GroupID,
RANK() OVER (ORDER BY ama.author, split.Item) AS AuthorID,
ama.author, split.Item
FROM dbo.authors_temp ama
CROSS APPLY Dbo.DelimitedSplit8k(ama.author,',') split

This gives a result set that has a sequential GroupID without gaps corresponding to each individual multiple of authors in the original dbo.Library, and a sequential AuthorID without gaps corresponding to each individual author, with correlation between each multiple and each individual.
Putting this into some final code to insert the values:

SET IDENTITY_INSERT dbo.Authors ON
INSERT INTO dbo.Authors (authorID, AuthorName, Gender, DateOfBirth)
SELECT  RANK() OVER (ORDER BY ama.author, split.Item) AS AuthorID,
split.Item, NULL, NULL
FROM dbo.authors_temp ama CROSS APPLY dbo.DelimitedSplit8k(ama.author,',') split
SET IDENTITY_INSERT dbo.Authors OFF


INSERT INTO dbo.AuthorGroups
SELECT
DENSE_RANK() OVER (ORDER BY ama.author) AS GroupID,
RANK() OVER (ORDER BY ama.author, split.Item) AS AuthorID,
FROM dbo.authors_temp ama
CROSS APPLY Dbo.DelimitedSplit8k(ama.author,',') split

Let's dispense with the singleton authors by loading them into our dbo.Authors table too:

INSERT INTO dbo.Authors (AuthorName, Gender, DateOfBirth)
SELECT author, NULL, NULL FROM #allSingletonAuthors;
DROP TABLE #allSingletonAuthors;

Now at this point, we have duplicates in our Authors table because there are two data sources for the authors, the multiple rows and the singleton rows.  So we'll de-duplicate this data using exactly the same method as at the beginning of this article.  But once we're done, we'll have to update the AuthorGroups.AuthorID column to match the new, single author IDs.

;WITH theAuthors(author) AS
( SELECT AuthorName
 FROM dbo.Authors
)

SELECT  MAX(AuthorID), AuthorName, Gender, DateOfBirth
INTO  #tempAuthorInfo 
FROM  dbo.Authors au INNER JOIN theAuthors theau
ON  au.AuthorName = theau.AuthorName
GROUP BY au.AuthorID, au.AuthorName, au.Gender, au.DateOfBirth
TRUNCATE TABLE dbo.Authors;
INSERT INTO dbo.Authors
SELECT * FROM #tempAuthorInfo
UPDATE  dbo.AuthorGroups ag
SET  ag.AuthorID = au.AuthorID
FROM dbo.Authors au
WHERE ag.AuthorID = au.AuthorID

Task 5) is the remaining data to migrate.  If you'll recall, our original table structure looked like this:

CREATE TABLE dbo.library (
LibraryID INT IDENTITY(100,5) IDENTITY PRIMARY KEY NOT NULL,
ISBN VARCHAR(12),
Publisher VARCHAR(50),
BookTitle VARCHAR(100),
Author VARCHAR(100) ,
YearPublished INT )

We've populated our dbo.Publishers, dbo.Authors and dbo.AuthorGroups tables.  So now we have to populate our new main table, which looks like this:

CREATE TABLE dbo.library_new (
ISBN   INT PRIMARY KEY NOT NULL,
PubID  INT FOREIGN KEY REFERENCES dbo.Publishers(PubID) NULL,
BookTitle VARCHAR(MAX) NOT NULL,
AuthorID INT FOREIGN KEY REFERENCES dbo.AuthorGroups(GroupID) NOT NULL,
YearPublished INT NULL )

Let's do:

INSERT INTO dbo.library_new
SELECT old.ISBN, pub.PubName, old.BookTitle, auth.GroupID, old.YearPublished
FROM dbo.library old
INNER JOIN dbo.Publishers pub ON old.Publisher = pub.PubName
INNER JOIN dbo.Authors au ON au.AuthorName =  old.Author
INNER JOIN dbo.AuthorGroups auth ON auth.AuthorID = au.AuthorID
ORDER BY ISBN

This populates the data into the new schema.  At this point, for optimisation we might choose to create new non-clustered indexes (clustered indexes on all the primary keys already exist) and write our stored procedure interface for application use.  However the new schema is ready to go and we can archive and drop the old table.

An ERD of the new schema and the complete uninterrupted code of the example given in this article are available for download at http://www.derekcolley.co.uk/.

Friday, June 15, 2012

New Website Launched!


NEWS:  http://www.derekcolley.co.uk/ has been launched!  Providing SQL Server database consultancy and DBA expert support in Manchester, the North West and the UK, I'm pleased to provide specialist help for your bespoke problems.  Need to move from Access to SQL?  Upgrade from SQL Server 6/7 to 2012?  Need help with a complex set of schemas?  Contact me today for a free consultation and a friendly chat about your needs.

Having worked with a variety of Microsoft database software including Access, SQL Server 2000, 2005, 2008, 2008 R2 and now 2012 for over 10 years, I'm ideally placed to assist you.  Working as a full-time DBA in my own right, I continue to develop my skills in the real world and I'm available on an hourly rate or fixed-price basis to help you get your business back on track.

Contact me through http://www.derekcolley.co.uk/, directly at derek@derekcolley.co.uk or catch up with me on Twitter - #dcolleySQL or LinkedIn - http://uk.linkedin.com/pub/derek-colley/38/30a/aa7.