Regexp HTML Parsing

RedBassett
Moderator

Bark Different.

I'm a Marxist/Lennonist of the Groucho/John variety.

Posts: 15,405
Mini-Profile Theme: RedBassett's Mini-Profile

Regexp HTML Parsing May 1, 2012 21:53:41 GMT -8

Post by RedBassett on May 1, 2012 21:53:41 GMT -8

I have some HTML tags I want to catch with a regexp, however I am having trouble matching the characters between the tags. I currently am trying to get all characters that don't match a string of the closing tag. Example below:

<a href="something"><h1>I am a tag!</h1></a>

If i want to get all the content of the "h1" element, I would say "[^<]*". This doesn't work, however for the "a" element, since the opening tag of the "h1" would cause the matching to stop. How do I instruct it to match everything up to "</a"?

Last Edit: May 1, 2012 21:54:12 GMT -8 by RedBassett

Chris ProBoards Coder Posts: 3,036 Chris inherit Male ProBoards Coder 16846 0 Nov 19, 2012 15:20:20 GMT -8 Chris 3,036 *Pro* Member December 2003 cddude	Regexp HTML Parsing May 2, 2012 6:56:04 GMT -8 Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Chris on May 2, 2012 6:56:04 GMT -8 Typically you use backreferences. I'm a tad rusty, but try something like this: <(\w+).?>(.?)<\/\\1> It might also be $1 instead of \\1 in JS. Can't recall.

RedBassett
Moderator

Bark Different.

I'm a Marxist/Lennonist of the Groucho/John variety.

Posts: 15,405
Mini-Profile Theme: RedBassett's Mini-Profile

Regexp HTML Parsing May 2, 2012 11:12:55 GMT -8

Post by RedBassett on May 2, 2012 11:12:55 GMT -8

May 2, 2012 6:56:04 GMT -8 Chris said:

Typically you use backreferences. I'm a tad rusty, but try something like this:

<(\w+).*?>(.*?)<\/\\1>

It might also be $1 instead of \\1 in JS. Can't recall.

So it turns out (.*?) was the trick. I think I had lazy/not-lazy wrong, as what I was using there was matching the closing tag as if it were child content. Just changing that part fixed it without needing backreferences.

Thanks!

Post by RedBassett on May 1, 2012 21:53:41 GMT -8

Post by Chris on May 2, 2012 6:56:04 GMT -8

Post by RedBassett on May 2, 2012 11:12:55 GMT -8