Monday, 15 October 2012

Fisher-Yates (Knuth) Shuffle in PL/SQL

We were recently looking for a way to easily anonymize some customer data, specifically bank account numbers.

Part of our testing mechanism is to verify that a bank account and sort-code combination is valid and we retrieve the bank details for that combination (using a commercial account lookup service).  So we couldn't just replace the sort-code and/or account number with random numbers.

Conversely, replacing all account numbers with a handful of known, "safe" working values would lead to skewed data spreads, so that wouldn't work.

A better solution is to shuffle the existing data, moving the sort-code and account number pairs to a different, unrelated client record.

We found a description of the Fisher-Yates Shuffle (one variety of which is also known as the Knuth Shuffle) which seemed like a good way to achieve our aim.  It seems to be used quite widely to do exactly what we wanted, but I couldn't find any examples of implementing the shuffle in PL/SQL.

So, here's our solution:

 
set serveroutput on size unlimited
DECLARE

    jswap NUMBER ;
    jint NUMBER := 17 ;
    lcount NUMBER := 0 ;
    lrowcount NUMBER := 0 ;

    TYPE tab_ba     IS TABLE OF client_bank_accts%ROWTYPE INDEX BY BINARY_INTEGER;
    array_ba        tab_ba ;

    CURSOR r IS
    SELECT *
    FROM client_bank_accts;

    sav_banksortcd client_bank_accts.banksortcd%TYPE;
    sav_bankacctnbr client_bank_accts.bankacctnbr%TYPE;

BEGIN

   OPEN r;
   LOOP

      FETCH r 
      BULK COLLECT 
      INTO array_ba LIMIT 5000;

      lrowcount := array_ba.COUNT;

      FOR i IN REVERSE 1 .. array_ba.COUNT LOOP

         jswap := trunc(dbms_random.value(1, i+1));

         sav_banksortcd := array_ba(i).banksortcd;
         sav_bankacctnbr := array_ba(i).bankacctnbr;

         UPDATE client_bank_accts 
         SET    banksortcd = array_ba(jswap).banksortcd,
                bankacctnbr = array_ba(jswap).bankacctnbr
         WHERE  clientbankacctseq = array_ba(i).clientbankacctseq;

         UPDATE client_bank_accts 
         SET    banksortcd = sav_banksortcd,
                bankacctnbr = sav_bankacctnbr
         WHERE  clientbankacctseq = array_ba(jswap).clientbankacctseq;

         lcount := lcount + 2;

      END LOOP;
      EXIT WHEN r%NOTFOUND;

   COMMIT;
   END LOOP;
   CLOSE r;

   dbms_output.put_line('Completed. '||to_char(lcount)||' rows updated!');

END;
/

No comments:

Post a Comment