Jan 15, 2013; 17:31
Tim Taplin
trying to work out some regexp issues in lasso9
So, I've always steered clear of regex but I know there's a place for it. I'm working on updating the very useful csv tag on tagswap to be lasso9 friendly and running into some regex specific issues.
the original tag uses a regex to parse each line in the csv file on loading. The regex has a glitch that creates an array entry for the commas which actually contains a comma. This is handled in the next processing step nicely in lasso8 but in lasso9, the same query creates an array entry which is null.
Since it is possible that there could be both empty and null values in a csv file, I cant just look for empty array values and delete them. I've been trying to figure out why the regex creates this extra entry, and whether there is some reason for the change in behavior. Also looked to work out a way to resolve this issue at the regex level. All with no success and much hair pulling.
Here is the basic code and the regex as it stands mildly ported to lasso9 syntax, where the local #line contains the line string.
local(field = string() )
local(i = null)
local(row = array())
local(linesplit = string_findregexp( #line, -find = '"(?:[^"]|"")*"|[^,]*|,'))
iterate( #linesplit, #i )
if(#i == ',')
#row->insertlast(#field)
#field = ''
else(#i->beginswith('"') && #i->endswith('"'))
#field += #i->substring(2, #i->size - 2)->replace('""', '"')&;
else
#field += #i
/if
/iterate
I think that the issue may relate in some way to the note in the documentation regarding grouping:
If groups are defined in the -Find expression then the output contains the entire search result followed by each of the sub-groups. If there were 2 matches of the expressions and 2 sub-groups then the array contains a total of 6 items.
As I read the regex it is searching for
(a leading doublequote followed by 0 or 1 repetitions of a non doublequote character or two doublequotes until the next doublequote)
or
(any number of non comma characters or a comma)
I think that my issue is that I'm getting the comma match and the grouped match for the quotes, but cant figure out a way to remove that element without breaking the parsing of unquoted numeric, null, or empty values.
I thought I remembered that there were some differences in regex behavior in lasso9 but cant find those discussions or any documentation referring to the differences.
Any help would be appreciated.
Tim Taplin
#############################################################
This message is sent to you because you are subscribed to
the mailing list Lasso
Lasso@lists.lassosoft.com
To unsubscribe, E-mail to: <Lasso-unsubscribe@lists.lassosoft.com>
Send administrative queries to <Lasso-request@lists.lassosoft.com>
Jan 15, 2013; 20:29
Brad Lindsay
Re: trying to work out some regexp issues in lasso9