|
|
Home /
Groups /
Regular Expressions (RegEx)
Find a table cell containing an image tag
This is probably one of those look ahead|behind things that I still doIan Skinner 08/27/08 05:25 P > This is probably one of those look ahead|behind things that I still doPeter Boughton 08/27/08 06:47 P Peter Boughton wrote:Ian Skinner 08/28/08 09:31 A > the reason I'm looking for these <img...>Peter Boughton 08/28/08 10:01 A Peter Boughton wrote:Ian Skinner 08/28/08 03:28 P > they are irrelevant and unavailable to this conversion exercise. IPeter Boughton 08/28/08 04:10 P This is probably one of those look ahead|behind things that I still do not grasp about regex. I want to find <td...> tags in a string of HTML that contain an <img...> tag. But I need the entire <td..>...<img...>...</td> string to replace it. But I need to ignore all the other <td...>...</td> blocks that do not contain image tags. TIA Ian > This is probably one of those look ahead|behind things that I still do ----- Excess quoted text cut - see Original Post for more ----- Can you elaborate on the context of what you're trying to do here? Because stuff like this is generally much much simpler with either XPath or jQuery. For example, here's how XPath matches any td containing an img: //td[//img] And here's how jQuery does it: $j( 'td:has(img)' ) To do it with RegEx, you want something like this: (?ims)<td[^>]*>(?:[^/]|/(?!td>))*<img[^>]+>.*?</td> So yeah, if you really have a specific need to do it with RegEx, you can probably use that, but I'd generally go for either of the other two for any tag-based issue like this. Peter Boughton wrote: > Can you elaborate on the context of what you're trying to do here? > > Because stuff like this is generally much much simpler with either > XPath or jQuery. Yeah, XPath would be nice but the reason I'm looking for these <img...> tags, is because they are not XHTML compliant and I needed to fix that so that I could get this HTML fragment into an acceptable XML mode. With a great deal of trial and error I finally came up with this. <td align="center">.*?(?=<img).*?</td> Luckily in this HTML string, the image <td..> blocks are the only ones with the align="center" property. > the reason I'm looking for these <img...> > tags, is because they are not XHTML compliant and I needed to fix that > so that I could get this HTML fragment into an acceptable XML mode. So, why are you worrying about the tables if it's the image tags that are the issue? Just replace <img([^>]*[^/])> with <img\1/> and not worry about the tables? I believe there are some XPath processors that can handle non-XML HTML, but only because I've read a couple of references to such - couldn't give any examples/recommendations. > With a great deal of trial and error I finally came up with this. > <td align="center">.*?(?=<img).*?</td> Well, with a known set of source data that's fine, but potentially that could match <td align="center">[stuff]</td><td><img/></td> That's why the more general solution needs the negative lookahead for /td> - to make sure that any img found is still within the opening table cell. The non-greedy matching (.*?) is not smart enough to do that for you. Hope this all helps. :) Peter Boughton wrote: > Just replace <img([^>]*[^/])> with <img\1/> and not worry about the tables? > The main reason is that the fix I'm using is to remove the images, since they are irrelevant and unavailable to this conversion exercise. I could make them compliant and ignore them. But since this is more a learning exercise then anything, I wanted to see if I could match the entire <td...> block and remove it, since without the <img...> tag the cells would be empty. ----- Excess quoted text cut - see Original Post for more ----- It did, but could somebody parse the regex syntax for me in plain English. I can say I could only understand about half of it on sight. <td[^>]*>(?:[^/]|/(?!td>))*<img[^>].*?</td> The main features I do not grok is the what role the [^>] plays and how to interpret this part in front of the negative look behind; ?:[^/]/ > they are irrelevant and unavailable to this conversion exercise. I > could make them compliant and ignore them. But since this is more a > learning exercise then anything... Fair enough on both counts. :) > The main features I do not grok is the what role the [^>] plays That's for if you don't know (or don't want to limit) what attributes a tag contains. A tag cannot contain a > character - any necessary ones would be escaped as > You could use a non-greedy wildcard like <tag.*?> but I use <tag[^>]*> as it is more precise. > how to interpret this part in front of the negative look behind; ?:[^/]/ Ah, you're mis-reading that slightly. There are a few parts at play here, I'll attempt to explain them individually in a simpler context... (note, I'm adding spaces purely for readability - pretend there are no spaces in any of the following examples) ( x | y ) is the standard "x OR y" - the parentheses are necessary to prevent the OR from applying to the whole of the expression. However, using parentheses means that regex will capture the contents for a backreference. This is not necessary here, so tell it to discard the contents, we put ?: inside the parens, so we get (?: x | y ) p.s. this also works without the OR operator - just as (?: x ) The first part of the OR is [^/] which means simply "not /" - putting caret (^) inside brackets negates them. e.g [^abc] means "a single character that is not a nor b nor c" Then, there's a negative lookahead which is (?! x ) and is the inverse of a regular lookahead - i.e. it makes sure the contents of the parens are NOT there. As with all lookarounds, it is zero-width - it matches only a position not actual characters. That is perhaps the key to understanding how they work - that no characters are ever consumed by a lookaround, but they still must match against the characters that follow the current position. Since we're dealing with a position, we need a preceeding character to actually proceed with the match For example x (?! y ) will match any x that is not followed a y (but it will match only the x and will continue checking the rest of the pattern from the next character). Since I mentioned the non-capturing (?: x ) above, I'll point out that this command is implicit in all lookarounds - they do not capture their contents for backreferences. So, to put all that together, what all this (?: [^/] | / (?! td> ) ) is actually saying is: Look for anything that is not a slash OR if you do find a slash only accept it if it is not followed by the characters "td>", and when you find either of these don't bother remembering it and just move on. Or, put simpler, "if you find /td> in this section then stop trying to match" Hopefully all of that makes sense? Feel free to ask if any part is unclear. :) -- Peter Boughton //hybridchill.com //blog.bpsite.net
|
Mailing Lists
|
Latest Fusion Authority Articles
|
||||||