|
Mailing Lists
|
Home /
Groups /
Regular Expressions (RegEx)
Find string with more then numbers between two other strings
I'm struggling with a regular expression to match this:Ian Skinner 03/07/11 01:41 P Get a better parser. SAX - is designed for stream processing - isBarney Boisvert 03/07/11 02:41 P On 3/7/2011 11:40 AM, Barney Boisvert wrote:Ian Skinner 03/07/11 05:02 P Actually, now that I think about it a little more....Barney Boisvert 03/07/11 05:14 P I'm struggling with a regular expression to match this: <SITE_LOC_ID><!--help--></SITE_LOC_ID> Where the help content is any string that contains one or more characters that are NOT digits. I know [^0-9] would match a single non-digit character. But I don't know how to allow there to be one or more such characters mixed in zero or more digit characters. I need to return the strings that match this so that I can replace them. Plus this is a xml file that will have upwards of 75,000 record nodes each with one of these <SITE_LOC_ID>...</SITE_LOC_ID> nodes along side several other nodes of each record. So I want to make sure I match only the content of a single node. TIA P.S. The file is too large to parse into an XML data structure, so I am doing simple string replace() and rereplace() functions to modify the XML text file. Get a better parser. SAX - is designed for stream processing - is exactly what you need. The DOM-centric CF XML stuff is great for simple stuff, but as you've found, only works for small documents. I haven't checked CF9, but CF8 uses the Apache XML tooling, which includes a SAX parser. I'd expect CF9 to be the same. If you can't/won't and must pursue the regex approach, you'll again need to get a better RegEx engine than what CF ships with. Specifically one that supports lookahead and lookbehind to anchor yourself. Again, you already have what you need as the java.util.regex package provides all this functionality for you (CF uses ORO, which doesn't have it, instead of the Java-native stuff). The gist is this (which will work from CFML as-is): newXmlString = xmlString.replaceAll(regex, replaceString); cheers, barneyb ----- Excess quoted text cut - see Original Post for more ----- On 3/7/2011 11:40 AM, Barney Boisvert wrote: Well, I'm not really looking for a new XML parser at this time as ColdFusion is not expected to parse the file. I am only trying to clean up an example, demonstration file that the can then be used for other testing purposes. > Specifically one that supports lookahead and lookbehind to anchor > yourself. Ok, but I can not see how to use lookahead (which does exist without going into the Java) or lookbehind to do what I need. As best as I can tell, those would be great to get some of the numbers, if there where non-digit characters in the string. But I do not see how to match the ENTIRE string, IF one or more of the characters in the string are a character. I.E. <tag>19984798</tag> NOT a match. <tag>18435A89</tag> IS a match, return 18435A89. <tag>Z8457920</tag> IS a match, return Z845792. <tag>7493841-</tag> IS a match, return 7493841- ETC. Actually, now that I think about it a little more.... Is it safe to assume that you have well-formed XML (even though you don't want to parse it), that the <tag> element has no child elements, and finally that there are no comments or CDATA blocks in the file? If so, you can look for <tag>([^<]*[^0-9][^<]*)</tag> and that should get you what you want. I still think a better solution would be to SAX is through a pipeline to modify it stream-wise, because any time you're manipulating XML in a non-XML-aware fashion you're just asking for pain. But if my stated assumptions are valid and you're confident that they will always remain so, that regex should work. If those assumptions aren't valid or you don't feel comfortable relying on them, you'd gonna need an XML-aware mechanism to process the XML. cheers, barneyb ----- Excess quoted text cut - see Original Post for more -----
|
May 25, 2013
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||