Removing Duplicates in SQL

Removing Duplicates in SQL

Removing Duplicates from a Text File in SQL

In this blog, we explain how to Remove Duplicates and block comments in SQL. And then explore various options to Remove Optional Keywords in the SQL database.

 

You have a new project. It deals with a lot of old SQL codes that do not have proper documentation. When you check the database, you see a lot of tables—some with names like inventory_bak or orders_1.

 

You are pretty sure that a lot of these tables are not in use. You cannot find any ERD or a master table script.

 

One thing you know for sure is that you have a set of scripts (PL/SQL functions and procedures) that are in active use and that there is an exhaustive list of SQL codes (other than a table or index creation, etc.).

 

If only you could find out from these scripts which of the tables (a list of distinct table names) are actively in use. You feel that there are not too many. You could look at these tables to make a reasonably good ERD yourself (or possibly use a reverse-engineering tool).

 

For the sake of ease, you can easily concatenate all of the .sql files into one. You can work on a copy of this combined file. The main goal is to extract the (distinct) table list.

 

Solution

In this language, the line comments start with two consecutive dashes (they need not be at the beginning of the line) and the block comments are enclosed within /* and */ (delimiters included). Keywords are not case sensitive, but table names are not. Also, any table names that occur within comments are not of interest.

 

The search should focus on other points with respect to the language syntax. A query is usually of the following form.


select <column list>

from tbl_a, tbl_b,

tbl c;

or

select <column list>

from tbl_a, tbl_b,

tbl c

where some conditions;

For select statements, table names should be between the from keyword and either a semicolon (';') or the where keyword, separated by commas (when more than one), but not necessarily occurring on the same line.

 

However, the name of a single table does not span multiple lines and only has letters (in either case) and an underscore.

 

The following is one form of an update statement.


update tbl_a set

<column value assignments>

from tbl_a, tbl_b

where some conditions;

The from part is optional, but if it exists, it is similar to that of the select statement.

For insert statements, the forms are as follows.

insert into tbl_a (column list) values (value list);

or

insert into tbl_a values (value list);

The keyword into is optional.

For delete statements, the forms are as follows.

delete from tbl_a;

or

delete from tbl_a

where some conditions;

In either case, the keyword is optional.

For update and delete, the first table mentioned is a single one. insert always works on a single table. There may be subqueries within the where clause, but that too has a form part.

 

You never look at the where part for table names (but you do look at the form part, as well as insert, update, and delete statements, in the first table). However, the form part of subqueries is considered (in this case, the delimiter is either where or closing parenthesis).

 

For simplification, consider that this language has no table alias concept. Also consider that in the active code (any portion that is not commented), the use of the word from only happens in a from clause. 

 

Words may appear with one or more whitespace in between. A comma (',') in the scope may have zero or more whitespaces on either side. Whitespace includes newlines, but you may treat them separately.

 

For the sake of simplification, consider that a comma appears in the same line after a table name (it may have one or more spaces or tabs, or both).

 

And a semicolon and a closing parenthesis also appear similarly in the same line with the immediately preceding token or keyword. Also, two select, insert, delete, or update statements (or two of any combination of these keywords) do not occur on the same line in the uncommented code.

 

The Technical Problem Specification

We need to look for tokens (table names) and make a distinct list/set of them (and print that to a file). The tokens are themselves case sensitive but other keywords are not.

 

Keeping in mind the possible whitespace separation part as applicable, they have to be searched within.

  1. insert or insert into and '(' or values – single occurrence
  2. update and set – single occurrence
  3. delete or delete from and semicolon(;) or where – single occurrence
  4. from and semicolon(;) or where or closing parenthesis – single occurrence or multiple occurrences separated by commas

 

Step 1: Remove the Comments

The first step is to remove the comments and possibly create another file, which is the input for further processing. After all, we are only interested in the list of tables, not the final SQL file.

 

Removing block comments was discussed in an earlier task, so this step won’t be repeated here. For the sake of simplicity, in this step, you have a file that has all the block comments removed.

 

For removing the line comments (from the double dash (–-) until the end of the line), the following code is good enough.


infile = File.open 'inp1.sql','r'

outfile = File.open 'out.txt','w'

while line = infile.gets

outfile.print line.sub(/\-\-.*$/,'').sub(/^$\n/,'')

end

infile.close

outfile.close

The first substitution function with the /\-\-.*$/ the pattern is for removing the line comments. The second substitution, with the /^$\n/ pattern, should remove empty lines (except leaving one empty line at the end of the file in certain cases).

 

Not having any empty lines between active code parts can greatly help further searches. Going forward, the output file of this code is the input for the next step, which is devoid of any comments and contains only active code.

 

Step 2: Remove Optional Keywords

Optional keywords are from after delete and into after insert. If you come to think of it, accommodating them in a regular expression isn’t straightforward. If we get rid of them at this stage, further steps will be smoother.

 

To develop this step, you can use the following in a file as simple input data.


delete FRom abc

another line

DELete def

delete

from ghi

INSert into abc1

insert def1

still another line

insert

INTO ghi1

Take the case of delete and from (insert and into are similar except the words are different).

 

First, consider the single-line case(s) and then take a case where the next line may have the optional keyword.

\1.\ If delete is followed with one or more whitespaces and then the word from in the same line, we have a match. Keep in mind that they have to be complete words and case should be ignored.

 

The /\bdelete\b[\s]+?\bfrom\b/i pattern should work.

\2.\ If the preceding pattern does not match and the line still has the delete keyword, then it is possible that there is no word after deleting in that line. Till the end of the line, after delete, it is whitespaces (or no character). In that case, if the first word in the next line is from, then we have a match.

 

In the first line, the /\bdelete\b[\s]*$/i pattern should match.

And the very next line (as there should be no empty line in the input file, except possibly in the end) should have the /^[\s]*\ bfrom\b[\s]*$/i pattern.

In absence of which the match isn’t available on that occasion.

 

On this occasion, we can operate on a string by joining two lines (joining two lines in a normal SQL file should not result in a huge string) and treating them as a single line for pattern replacement. (Consider it another way of doing what has been done for the 'create procedure' case, using $.).

 

The following code will work for a two-line (delete from) case, but it won’t work for single-line cases, which are simpler and can be augmented upon easily.


infile = File.open 'inp2.sql','r'

outfile = File.open 'out.txt','w'

while line = infile.gets

if line.match(/\bdelete\b[\s]*$/i)

firstlinefound = true

joinedlines = line

next #don't do any more processing for this line

end

if firstlinefound #for the very next line firstlinefound = false # reset it if line.match(/^[\s]*\bfrom\b/i)

#join this line to the last one

joinedlines = joinedlines + line

#treat the joined line as a regular line and replace

joinedlines = joinedlines.sub(/\bdelete\b[\s]+?\

bfrom\b/i ,'delete')

outfile.print joinedlines

joinedlines = '' #reset joinedlines

next #no need to go further down for this line else #the overall match for delete from failed in this case

#print the last line here

#the current line will get printed towards the end

of while loop anyway

outfile.print joinedlines

joinedlines = '' #reset joinedlines

end

end

outfile.print line

end

infile.close

outfile.close

After removing some comments and putting it in function form (in such a way that it can be used for the insert into the case also), and augmenting it with single line cases, the code is as follows.


def remove_optional_multiline(infilename, outfilename, typ)

if typ == 1 #delete from

word1 = "delete"

word2 = "from"

else

word1 = "insert"

word2 = "into"

end

pat1 = %r{\b#{word1}\b[\s]*$}i

pat2 = %r{^[\s]*\b#{word2}\b}i

pat3 = %r{\b#{word1}\b[\s]+?\b#{word2}\b}i

infile = File.open infilename,'r'

outfile = File.open outfilename,'w'

while line = infile.gets

if line.match(pat1)

firstlinefound = true

joinedlines = line

next

end

if firstlinefound #for the very next line firstlinefound = false # reset it if line.match(pat2)

joinedlines = joinedlines + line joinedlines = joinedlines.sub(pat3,word1) outfile.print joinedlines joinedlines = ''

next


else #the overall match for delete from failed in this case

outfile.print joinedlines

joinedlines = ''

end

end

outfile.print line

end

infile.close

outfile.close

end

def remove_optional_singleline(infilename, outfilename, typ)

if typ == 1 #delete from

word1 = "delete"

word2 = "from"

else

word1 = "insert"

word2 = "into"

end

pat = %r{\b#{word1}\b[\s]+?\b#{word2}\b}i

infile = File.open infilename,'r'

outfile = File.open outfilename,'w'

while line = infile.gets

line.sub!(pat,word1)

outfile.print line

end

infile.close

outfile.close

end

remove_optional_multiline('inp2.sql,'tmp1.txt',1)

remove_optional_multiline('tmp1.txt','tmp2.txt',2)

remove_optional_singleline('tmp2.txt','tmp1.txt',1)

remove_optional_singleline('tmp1.txt','out.txt',2)

File.delete('tmp1.txt')

File.delete('tmp2.txt')

Two functions have been defined for single-line and multiline cases; they are being called with different arguments.

Note that the tmp1.txt and tmp2.txt files are used as intermediate files, which are deleted after the main processing is done.

Note the use of %r while creating the patterns. This is one example:

pat1 = %r{\b#{word1}\b[\s]*$}i

It is used later in the following line.

if line.match(pat1)

%r{} is equivalent to // for building patterns.

But when using it with modifiers (in this case, I) together with variable substitution (in this case word1), this is much easier than the // construct.

For the value of word1 as 'delete', the preceding is effectively /\bdelete\b[\s]*$/i.

Notice also the way two lines are joined for replacement when a successful multiline case is identified.

 

Step 3: Collect Tokens

By this stage, the input file is already in good shape. There are no comments, no empty lines (except possibly at the end), and no optional keywords. At this step, the subtask is to collect the tokens (table names) that may occur.

  1. After the update keyword: a single token
  2. After the insert keyword: a single token
  3. After the delete keyword: a single token
  4. Between the from keyword and the where keyword, or a semicolon or a closing parenthesis: single token or multiple tokens separated by commas

Of course, newline and other whitespace character separation apply.

 

Note that I did not mention looking between two keywords. For example, I did not specify to look between the update and set keywords. This is because we should be looking for complete words (keywords), and since there is no commented code, the word update can only occur at the beginning of an update statement and nowhere else.

 

So if we just take the very next word, it should be the table name that we are looking for. We need not to bother about the set keyword.

 

Searching and Replacing in Text Files

Suppose that a project has a lot of .sql files: some define tables and others define procedures. Each of these files may have code for single or multiple tables, or single or multiple procedures. All of them are in the same directory

 

When all of those files run against a database in a batch for an existing procedure, the 'create procedure' statement fails, and that creates a problem for the whole batch. 

 

If, however, each of the creates was replaced with 'create and replace' appropriately, then the procedures already created will be skipped without throwing an error and the batch will continue smoothly.

 

The task is to replace 'create' with 'create or replace' as appropriate (before the word 'procedure').

 

Solution

To start, we can work on one file that has one or more create procedure scripts. The file used as input (at least initially) has the following data.


--This is a create procedure script

--for a few procedures such as inserProc, deleteProc and modifyProc /* All the procedures

work on the tables t_abc and t_def

and are created with

create procedure statement */

create procedure insertProc

a numeric,

b numeric out

begin

some code

some more code

end;

/

--create procedure delProc

CREATE procedure deleteProc -- procedure for deletion tbl varchar2(20)

begin

some code

some more code

end;

/

create

procedure modifyProc

tbl varchar2(20)

begin

some code

some more code

end;

/

create or replace procedure tranProc

tbl varchar2(20)

begin

some code

some more code

end;

/

--end of script

Consider that in that PL/SQL language, two consecutive dashes indicate a line comment and portions within /* and */ are blocked comments (including those delimiters). And as far as keywords are concerned, the language is not case sensitive.

 

If create is replaced with creating or replace within a comment, it won’t be taken seriously. There are a few points that need to be considered for now, but we may build up as we go along.

  1. There may be more than one space between the word 'create' and the word 'procedure'.
  2. There may be newlines between them.
  3. Keywords are not case sensitive.
  4. Some create procedures may already have 'create or replace'. 

 

The first point could be easily taken care of if we look for a pattern of one or more whitespace characters (which includes tabs) between 'create' and 'procedure'. Adding to that the case insensitivity with keywords, we can try a pattern (with the i modifier) like

- /\bcreate[\s]+?procedure\b/i.

 

Note that this has been made non-greedy. The word boundaries are necessary because we do not want to pick up something like 'non-create procedures'. The words 'create' and 'procedure' have to be whole.

 

Also, it would not work across a newline. Why? Because we will be reading line by line (not the entire file in one shot). Try the following code.


infile = File.open 'crprc.txt','r'

outfile = File.open 'out.txt','w'

while line = infile.gets

line = line.gsub( /\bcreate[\s]+?procedure\b/i,'create or replace') outfile.print line

end

infile.close

outfile.close

You will find that it tackled a few cases but could not tackle a few others. 

It has replaced uppercase letters with lowercase letters (which we can live with since that would not be an issue for running SQL scripts. However, it has not worked across multiple lines (when the word procedure is in another line). It worked within comments, but that is OK for our purpose.

 

As an aside, check the previous recipe in this blog to see if you can implement a solution for this task, where changes are not made within comments.

Note that it did not adversely affect cases, where create or replace, was already there, which is good.

 

It’s time to tackle the multiline case. For this, we use a smaller data file, input2.txt, which has the following data.


create procedure abc

begin

end

create

procedure def

begin

end

For this part, we first search for lines that have 'create' (followed optionally by any number of whitespace characters until the end of the line) but not 'procedure'. Once such a line is found, we can flag it, and check if the next line has the word procedure preceded by (optionally) zero or more whitespaces.

 

If such is the case for the next line, then take the preceding part of the first line (before 'create'), the succeeding part of the second line (after 'procedure'), and join them with 'create or replace procedure' in between.

 

If the second line, however, fails to match an eligible 'procedure', unset the flag (set on encountering the first line) and move on.

 

The preceding should take care of any multiline affairs. Then another pass can be made with the original code for 'create' and 'procedure' in the same line cases. But this part is already coded, so we can concentrate on the multiline part only for now.

 

The following code works well for the (multiline) purpose. This is with the assumption that the word procedure occurs in the next line furthest from the word create.


infile = File.open 'input2.txt','r'

outfile = File.open 'out.txt','w'

while line = infile.gets

if line.match(/\bcreate[\s]*$/)

createline = true

prematch = $`

line1 = line

checklinenum = $. + 1

next #don't do any more processing for this line

end

found = false

if createline and checklinenum == $. #very next line

createline = false # reset it anyway

if line.match(/^[\s]*procedure\b/)

postmatch = $'

found = true;

end

#print for both the lines either way

if not found

outfile.print line1

outfile.print line #second line

else #found

outfile.print "#{prematch}create or replace procedure#{postmatch}"

end

next

end

#it is not the createline or the nextline outfile.print line

end

infile.close

outfile.close

It produces an out.txt file with the following text.

create procedure abc

begin

end

create or replace procedure def

begin

end

The $. (a predefined variable that holds the line number of the last line read from the current input file) has been used to ensure that the very next line is dealt with when such is applicable.

 

Combining the bits, making small changes as necessary, making things function-oriented, and also accounting for multiple .sql files, we have the following.


def replace_create(sqlfile)

infile = File.open sqlfile,'r'

outfile = File.open 'out.txt','w'

while line = infile.gets

if line.match(/\bcreate[\s]*$/)

createline = true

prematch = $`

line1 = line

checklinenum = $. + 1

next #don't do any more processing for this line

end

found = false

if createline and checklinenum == $. #very next line

createline = false # reset it anyway

if line.match(/^[\s]*procedure\b/)

postmatch = $'

found = true;

end

#print for both the lines either way

if not found

outfile.print line1

outfile.print line #second line

else #found

outfile.print "#{prematch}create or replace procedure#{postmatch}"

end

next

end

#it is not the createline or the nextline outfile.print line

end

infile.close

outfile.close

#second pass (not multiline case)

infile = File.open 'out.txt','r'

outfile = File.open sqlfile,'w'

while line = infile.gets

line = line.gsub( /\bcreate[\s]+?procedure\b/i,'create or

replace')

outfile.print line

end

infile.close

outfile.close

end

arr = Dir.glob('*.sql')

arr.each {|filename|

replace_create(filename)

}

It should be working fine.

 

Recommend