I've recently been working with a WYSIWYG editor that provides an HTML output, but it also added in certain HTML attributes that I didn't need to store because they are the default values. For example, I wanted to remove colspan="1"
and rowspan="1"
from table cells.
I initially used a Regular Expression to get these specific attributes, and it looked like this:
html.replace(/( colspan="1")?( rowspan="1")/gm, "");
But then as a few more attributes popped up, and I started having to chain multiple .replace
methods together, I figured I could make something a little more reusable.
What I wanted to do was match every HTML attribute in the HTML output string, turned out to be a little complex but nothing that regex can't handle!
Getting all attributes
I started with this basic version of the regex:
/\w+=".+?"/gm
Which works. Kinda. \w
matches any word character +?
between one and unlimited times. So it will match any word up to the =
sign (our attribute name). It matches attributes like class="some-class"
but not attributes with hyphens like data-columns="3"
.
So I switched it up to
/[\w-]+=".+?"/gm
[\w-]
for word characters or (|
) dashes.
Then we have ".+"
which will match whatever is between "..." - the attribute value.
This works fine for my HTML output since the attributes will always use "
but if yours could also be '
then you might need ["|']
.
Removing fake attributes
What if someone puts a fake attribute in your HTML as text? For example <p>test="false"</p>
. That's not an attribute! Because it's not in an HTML tag.
const html = '<div>fake="true"<h1 class="title-class"></h1><form><label data-columns="3"><input name="name" id="name" placeholder="Your Name" disabled /></label></form></div>';
const matches = html.match(/[\w-]+=".+?"/gm);
console.log(matches);
// ["fake="true"", "class="title-class"", "data-columns="3"", "name="name"", "id="name"", "placeholder="Your Name""]
So how can we solve this?
Well, luckily for us, regex has a pretty cool feature - Lookahead and Lookbehind.
Let's start by looking behind ((?<=...)
).
Behind our attribute, there should be an HTML start tag, which will begin with <
. But it will also contain the tag name... e.g. div
, h1
, p
, table
etc. I'll use <.+?
to match <
and then any character between one and unlimited times (the tag name) followed by a space, before our attribute match. Because the tag name will have a space before any attributes.
The Lookbehind will look like this: (?<=<.+? )
.
/(?<=<.+? )[\w-]+=".+?"/gm
But were still matching that fake attribute. Because there is an opening <
before it, but the start tag has been closed with >
.
So I'm also going to add a negative Lookbehind ((?<!...)
), to say that after the opening <tag
there shouldn't be a closing >
.
The negative Lookbehind will look like this: (?<!>)
Here's how the Regular Expression looks now:
/(?<=<.+? )(?<!>)[\w-]+=".+?"/gm
Now our fake attribute isn't matched:
const html = '<div>fake="true"<h1 class="title-class"></h1><form><label data-columns="3"><input name="name" id="name" placeholder="Your Name" disabled /></label></form></div>';
const matches = html.match(/(?<=<.+? )(?<!>)[\w-]+=".+?"/gm);
console.log(matches);
// ["class="title-class"", "data-columns="3"", "name="name"", "id="name"", "placeholder="Your Name""]
For completeness, and to make double sure that our attributes are within a valid HTML start tag, let's also add a Lookahead ((?=...)
).
We need to check that after our match, the start tag closes (>
). But there might also be other attributes and things, so we will also match any character between zero and unlimited times (as few as possible) first .*?
.
The Lookahead will look like this: (?=.*?>)
.
And just like we did at the start, let's make sure another HTML start tag hasn't opened up first. We can add a negative Lookahead ((?!...)
) before our positive Lookahead.
The negative Lookahead will look like this: (?!<)
Here's how the Regular Expression looks now:
/(?<=<.+? )(?<!>)[\w-]+=".+?"(?!<)(?=.*?>)/gm
Now we are ready to tackle a new challenge - boolean attributes.
Getting boolean attributes
The actual match [\w-]+=".+?"
works great for your standard attributes. Your class="alert alert-red"
or whatever. And I was pretty much done until I remembered those sneaky Boolean attributes - yeah disabled
- I’m looking at you!
<input name="name" id="name" placeholder="Your Name" disabled />
My regex won’t pick up that disabled
attribute.
And that's because the Regular Expression is looking for attribute="value"
but it might just be attribute
.
So let’s fix it.
The attribute could end with =".+?"
or it might end with a space. But I don't want to match the space. In regex, we can match anything that's not a word without consuming any characters with a word boundary \b
.
We can wrap these two options in a regex group ((...)
) and say =".+?"
OR \b
. Or in regex is |
, so our update will change =".+?"
to (=".+?"|\b)
.
We don't want to capture this group (yet! - more on this later), so let's make our group a non-capturing group - which looks like this (?:…)
.
[\w-]+(?:=".+?"|\b)
Okie dokie. Here's how we are looking:
/(?<=<.+? )(?<!>)[\w-]+(?:=".+?"|\b)(?!<)(?=.*?>)/gm
Which uses a positive look behind to check for the starting HTML tag ((?<=<.*? )
), and a negative look behind ((?<!>)
) to make sure that the tag hasn't already closed (so if there's a class="whatever"
outside of an HTML tag it doesn't match).
And positive and negative look ahead ((?!<)(?=.+?>)
) to also check that it's a genuine HTML attribute.
And that works!
const html = '<div>fake="true"<h1 class="title-class"></h1><form><label data-columns="3"><input name="name" id="name" placeholder="Your Name" disabled /></label></form></div>';
const matches = html.match(/(?<=<.+? )(?<!>)[\w-]+(?:=".+?"|\b)(?!<)(?=.*?>)/gm);
console.log(matches);
console.log(`There are ${matches.length} HTML attributes`);
// ["class="title-class"", "data-columns="3"", "name="name"", "id="name"", "placeholder="Your Name"", "disabled"]
// "There are 6 HTML attributes"
Replacing attributes with regex
Let's go back to my original use case. I want to replace any colspan
and rowspan
attributes that are the default value ("1"
).
To start with, I need to tweak the regex to match the space before the attribute. Because otherwise, if I remove an attribute, without removing the space before it, I'll end up with extra spaces in the HTML.
So instead of checking for the space after the tag in the Lookahead, I'll get it in the match:
/(?<=<.+?)(?<!>) [\w-]+(?:=".+?"|\b)(?!<)(?=.*?>)/gm
Now I can use the replacer function to replace an attribute and not end up with two spaces before the next one:
const html = '<div><h1 class="title"></h1><table><tr><td colspan="1" rowspan="3"></td></tr></table><form><label><input name="name" id="name" placeholder="Your Name" disabled />test="false"</label></form></div>';
const result = html.replace(/(?<=<.+?)(?<!>) [\w-]+(?:=".+?"|\b)(?!<)(?=.*?>)/gm, function(match) {
if (match.includes('colspan="1"') || match.includes('rowspan="1"')) {
return '';
}
return match;
});
console.log(result);
// '<div><h1 class="title"></h1><table><tr><td rowspan="3"></td></tr></table><form><label><input name="name" id="name" placeholder="Your Name" disabled />test="false"</label></form></div>'
I can check what attributes exist, and even split the attribute name and value - like this:
const html = '<div><h1 class="title"></h1><table><tr><td colspan="1" rowspan="3"></td></tr></table><form><label><input name="name" id="name" placeholder="Your Name" disabled />test="false"</label></form></div>';
const result = html.replace(/(?<=<.+?)(?<!>) [\w-]+(?:=".+?"|\b)(?!<)(?=.*?>)/gm, function(match) {
let [name, value] = match.trim().split("="); // remove space and split
console.log(name, value);
});
// "class" '"title"'
// "colspan" '"1"'
// " rowspan" '"3"'
// "name" '"name"'
// "id" '"name"'
// "placeholder" '"Your Name"'
// "disabled" undefined
If you need to know what tag an attribute belongs to, check out how do you find all the HTML tags in regex?