Sometimes, you might need to use regex to parse HTML. And maybe to make changes to that HTML, or check if the HTML contains, or doesn't contain certain things.
I first wrote about using regex with HTML over six years ago when I was scraping HTML responses with regex to get specific data, and I used regex to get the data I needed.
Regex (or Regular Expressions to give them their full name) can help you match all kinds of patterns in strings. And while HTML isn't a string, it's often provided as a string (or can be turned into a string). For example, if you are sending HTML, or storing it, it will often be as a string.
So when I needed a way to get HTML tags from a string of HTML, I thought regex might be able to provide the answers.
And it did! 👏
What I set out to achieve with this blog post is to use regex to get the tag name and a list of all the attributes from each HTML start tag.
If you just want a list of HTML attributes, check out how to get HTML attributes using regex instead!
Ok before we get into the nitty-gritty, here’s the code I ended up with:
/<(?<tag>[^/]\w+?)(?: (?<attrs>.*?))?( ?\/)?>/gm
Erm, ok, as usual, Regular Expressions are super easy to read and understand!
Let's see what's going on here (and how to use it!).
Get the tag
The start tag (<h1>
, <div>
, <p>
, <table>
etc) of an HTML element is where you find attributes. Since there are lots of different HTML elements, we'll need to know what kind of element we matched, especially if we need to, for example, remove a certain attribute but only if it's on a particular element - like a paragraph or a table cell.
We can match all the HTML elements like this:
/<.+?>/gm
This regex will match every start tag (<h1 class="title">
), but also every end tag (</h1>
). We only want to match the start tags (since that's where the attributes live) so we can update our regex to this:
/<[^/].+?>/gm
Now we are only matching the start tags (with attributes).
const html = '<div><h1 class="title"></h1><form><label><input name="name" id="name" placeholder="Your Name" /></label></form></div>';
const matches = html.match(/<[^/].+?>/gm);
console.log(matches);
// ["<div>", "<h1 class="title">", "<form>", "<label>", "<input name="name" id="name" placeholder="Your Name" />"]
Let's add a named regex group for our tags ((?<tag>...)
) so we can separate the element from the attributes.
/<(?<tag>[^/].+?)>/gm
But this still matches everything in the start tag, including any HTML attributes. We want our tag
group to just be the tag. So let's change that .
matching any character to \w
to match word characters only.
/<(?<tag>[^/]\w+?)>/gm
That will break our regex for tags with attributes, so let's add another group for the HTML attributes, so we can keep our tag name separate.
Separate the tag from attributes
To get the HTML attributes, we'll need to add a second regex group.
A start tag might be <div class="container">
, so we have the tag name, a space and then attributes. So we can update our regex to match a space, and then anything else as few times as possible (( .*?)
).
We don't want to capture the space so let's wrap that in a non-capturing group (?: )
- so it now looks like this: (?: (.*?))
.
But since start tags might not contain attributes, we need to make this second group optional ((...)?
): (?: (.*?))?
Here's how the Regular Expression looks so far:
/<(?<tag>[^/]\w+?)(?: (.*?))?>/gm
Our regex will match the full tag (including its attributes), just like before:
const html = '<div><h1 class="title"></h1><form><label><input name="name" id="name" placeholder="Your Name" /></label></form></div>';
const matches = html.matchAll(/<(?<tag>[^/]\w+?)(?: (?<attrs>.*?))?>/gm);
for (const match of matches) {
console.log(match[0]);
}
// "<div>"
// '<h1 class="title">'
// "<form>"
// "<label>"
// '<input name="name" id="name" placeholder="Your Name" />'
But we now have a separate tag
group, so we know what type of element we've matched.
const html = '<div><h1 class="title"></h1><form><label><input name="name" id="name" placeholder="Your Name" /></label></form></div>';
const matches = html.matchAll(/<(?<tag>[^/]\w+?)(?: (?<attrs>.*?))?>/gm);
for (const match of matches) {
console.log(match.groups.tag);
}
// "div"
// "h1"
// "form"
// "label"
// "input"
This naming regex groups seems useful, so let's also name the attribute group attrs
.
/<(?<tag>[^/]\w+?)(?: (?<attrs>.*?))?>/gm
Nice!
const html = '<div><h1 class="title"></h1><form><label><input name="name" id="name" placeholder="Your Name" /></label></form></div>';
const matches = html.matchAll(/<(?<tag>[^/]\w+?)(?: (?<attrs>.*?))?>/gm);
for (const match of matches) {
console.log(match.groups);
}
// { tag: "div", attrs: undefined }
// { tag: "h1", attrs: 'class="title"' }
// { tag: "form", attrs: undefined }
// { tag: "label", attrs: undefined }
// { tag: "input", attrs: 'name="name" id="name" placeholder="Your Name" /' }
And that kinda works. The start tag is fully matched but we also have our named groups. The tag
group and the attrs
group.
One little detail I'm unhappy with - the input tag has a tailing (self-closing) slash, and I don't want that in the attributes group. So I've added another optional group (( ?\/)?
) to check for a trailing slash - with or without a space before it - which updates the regex to:
/<(?<tag>[^/]\w+?)(?: (?<attrs>.*?))?( ?\/)?>/gm
That's better!
const html = '<div><h1 class="title"></h1><form><label><input name="name" id="name" placeholder="Your Name" /></label></form></div>';
const matches = html.matchAll(/<(?<tag>[^/]\w+?)(?: (?<attrs>.*?))?( ?\/)?>/gm);
for (const match of matches) {
console.log(match[0])
console.log(match.groups);
}
// "<div>"
// Object { tag: "div", attrs: undefined }
// '<h1 class="title">'
// Object { tag: "h1", attrs: 'class="title"' }
// "<form>"
// Object { tag: "form", attrs: undefined }
// "<label>"
// Object { tag: "label", attrs: undefined }
// '<input name="name" id="name" placeholder="Your Name" />'
// Object { tag: "input", attrs: 'name="name" id="name" placeholder="Your Name"' }
We have two named groups, the tag
, which is your HTML element, div
, p
, span
, table
etc. And the attrs
which are your HTML attributes. And now we’ve used regex to match them, we can do something with them!
And now you can do something with it. After all, you’re probably not using a Regular Expression to get HTML tags for fun.
But it is fun... isn’t it?!
But maybe you need to find an HTML tag for some reason. Let's say you just need to get the image tags out for validation or checking, you can do that by checking the tag group for img
.
const html = '<div><h1 class="title"></h1><img class="big-image" src="/some/url" /><form><label><input name="name" id="name" placeholder="Your Name" /></label></form></div>';
const matches = html.matchAll(/<(?<tag>[^/]\w+?)(?: (?<attrs>.*?))?>/gm);
for (const match of matches) {
if (match.groups.tag === "img") {
console.log(match[0])
}
}
// '<img class="big-image" src="/some/url" />'
If you want to match attributes and don't care what tag they're on, check out my blog post on how to get HTML attributes using regex.