Don't Believe Acronyms!: Fisher-Yates (Knuth) Shuffle in PL/SQL

Monday, 15 October 2012

Fisher-Yates (Knuth) Shuffle in PL/SQL

We were recently looking for a way to easily anonymize some customer data, specifically bank account numbers.

Part of our testing mechanism is to verify that a bank account and sort-code combination is valid and we retrieve the bank details for that combination (using a commercial account lookup service). So we couldn't just replace the sort-code and/or account number with random numbers.

Conversely, replacing all account numbers with a handful of known, "safe" working values would lead to skewed data spreads, so that wouldn't work.

A better solution is to shuffle the existing data, moving the sort-code and account number pairs to a different, unrelated client record.

We found a description of the Fisher-Yates Shuffle (one variety of which is also known as the Knuth Shuffle) which seemed like a good way to achieve our aim. It seems to be used quite widely to do exactly what we wanted, but I couldn't find any examples of implementing the shuffle in PL/SQL.

So, here's our solution:

 
set serveroutput on size unlimited
DECLARE

    jswap NUMBER ;
    jint NUMBER := 17 ;
    lcount NUMBER := 0 ;
    lrowcount NUMBER := 0 ;

    TYPE tab_ba     IS TABLE OF client_bank_accts%ROWTYPE INDEX BY BINARY_INTEGER;
    array_ba        tab_ba ;

    CURSOR r IS
    SELECT *
    FROM client_bank_accts;

    sav_banksortcd client_bank_accts.banksortcd%TYPE;
    sav_bankacctnbr client_bank_accts.bankacctnbr%TYPE;

BEGIN

   OPEN r;
   LOOP

      FETCH r 
      BULK COLLECT 
      INTO array_ba LIMIT 5000;

      lrowcount := array_ba.COUNT;

      FOR i IN REVERSE 1 .. array_ba.COUNT LOOP

         jswap := trunc(dbms_random.value(1, i+1));

         sav_banksortcd := array_ba(i).banksortcd;
         sav_bankacctnbr := array_ba(i).bankacctnbr;

         UPDATE client_bank_accts 
         SET    banksortcd = array_ba(jswap).banksortcd,
                bankacctnbr = array_ba(jswap).bankacctnbr
         WHERE  clientbankacctseq = array_ba(i).clientbankacctseq;

         UPDATE client_bank_accts 
         SET    banksortcd = sav_banksortcd,
                bankacctnbr = sav_bankacctnbr
         WHERE  clientbankacctseq = array_ba(jswap).clientbankacctseq;

         lcount := lcount + 2;

      END LOOP;
      EXIT WHEN r%NOTFOUND;

   COMMIT;
   END LOOP;
   CLOSE r;

   dbms_output.put_line('Completed. '||to_char(lcount)||' rows updated!');

END;
/

No comments:

Post a Comment

Oracle & Me

I've been working with Oracle since 1988. Starting as a tape monkey, my first job in IT was as part of a two-man team building an international tour operator's reservation and billing system from the ground up on an early Sequent box. In retrospect, that sounds a lot more difficult and impressive than it seemed at the time.

I spent a couple of years at a logistics firm as a Forms2.3 developer, but I soon got 'promoted' to DBA - the money was better and the job was easier, even though the hours were much longer, in those days of overnight rebuilds and defrags.

Since then I've been permy, contract, permy, contract and finally permanent again.

I am now working as a production and development DBA (and Weblogic admin) at a healthcare insurance provider in the south of England.

My areas of expertise are SQL tuning, RMAN, troubleshooting and moaning about users and/or developers.