Tuesday, December 9, 2008

Remove commas when inside quotes

Remove commas when inside quotes

adcm (MIS)
21 Oct 98 12:27
I need to remove commas which are contained in between quotes. The following
comma seperated file

"ADSCB","","","Sean C. Berry","Shyster, Shylock, Shlomb & Shyew","New York,
New York","","","","","267-408-7883","267-408-1566","233-478-2000","233 478
2466",26,6,0,1,0,1,"10/19/1998","08:19"

Should look like this:

"ADSCB","","","Sean C. Berry","Shyster Shylock Shlomb & Shyew","New York New
York","","","","","267-408-7883","267-408-1566","233-478-2000","233 478
2466",26,6,0,1,0,1,"10/19/1998","08:19"

I know I should be using sed, but I cant seem to get it going??


Carlo Mauro
cmauro@wlrk.com
WLRK
bigscouse (MIS)
22 Oct 98 7:26
Try the following, it will work in korn and c shell. Create a file, say comma_remove.csh for instance, and add this line of text: -
cat temp1 ¦ sed -e "s/,/./g" > temp2 - where temp1 is the input filename and temp2 the output filename. Note don't forget to chmod the file to 777 or 111 etc so it is executable. Issue the following command from the command line ./comma_remove.csh and the commas will be replaced by full stops. It does work as I have just tried it out. If you want to replace the commas with something else then overwrite the /./ with whatever you want, if it's spaces you want then use / / or for nothing use //.
adcm (MIS)
22 Oct 98 17:24
I dont wish to remove all commas. Just the ones which fall in between quotes.

i.e. "New York, NY" should be interpreted as only one field.

The real hard part is the fact that not all fileds are enclosed in quotes!
Yacov (MIS)
7 Dec 98 8:57
sed 's/\"\(.*\),\(.*\)\"/"\1 \2"/' /tmp/in_file>/tmp/out_file

The above will work only if there is 1 , within the quotes
Yacov (MIS)
7 Dec 98 8:58
sed 's/\"\(.*\),\(.*\)\"/"\1 \2"/' /tmp/in_file>/tmp/out_file

The above will work only if there is 1 , within the quotes, you can
add as many , as you like by modifying the above line
derekludwig (Programmer)
25 Jun 99 16:28
Here is a solution for 'sed'. You will need
to create a sed-script file, "clean.sed", which contains:

#n
# add a trailing , to make parsing easier
/[^,]$/s/$/,/

# unless /..."aaa",/ convert /...aaa,/ to /...aaa#/
# repeat until done

:z
s/^([^",]*),/\1#/
tz

# convert /..."aaa,bbb",/ to /..."aaabbb",/
# repeat until done

:a
s/^([^",]*)("[^",]*),([^"]*"),/\1\2\3,/
ta

# convert /..."aaa",/ to /...@aaa@#/

s/^([^",]*)"([^"]*)",/\1@\2@#/

# repeat until no more /,/ or /"/ remain
/[,"]/tz

# get rid of trailing # [ was added at the start ]
s/.$//

# convert @ back to ", and # back to ,
s/@/"/g
s/#/,/g

p

You will have to invoke sed as follows:

sed -E -f clean.sed sourcefile > destfile

I do wonder about the removal of commas within a quoted field. A (name) field with a value like "Mouse, Mickey" would be converted to "Mouse Mickey", which might not be what was intended. If the purpose of removing the commas within quoted fields is to allow one to parse the file, I would recommend you look at the Perl Cookbook for how to parse "Comma Separated (CSV) Files". It shows how to split a line into fields on the 'correct' commas.

-- Derek
derekludwig (Programmer)
26 Jun 99 13:32
Forgot to including in my previous reply...

The sed script is converting , to # and " to @, from left to right. # and @ need to be some character that does not appear elsewhere in the file. If I were using this for 'real data', I would probably use <CTRL-C> and <CTRL-Q> instead of # and @.

-- D
flogrr (Programmer)
4 May 00 22:41
Here is a short nawk/gawk program that will do what
you want (remove all commas from inside double quotes):

Usage: thisfile input output <CR>

Where: thisfile = this program saved in a file and made
executeable ( chmod +x thisfile )

input = input file output = output file


nawk 'BEGIN{FS=""}

{
gsub(/,\ /,"\ ")
print

}' $1 > $2

flogrr

1 comment:

Magda said...

Hi. Thank you for these. Ive neen searching for something all day. The sed program mainly works for me but the output is all doubles (lines are written twice). i took out the p at the end because i was getting an error. Any ideas? Thanks