Escape and unescape HTML using python

This blog post teaches you to escape and unescape HTML strings and files using a built-in module in python called “html”. This module has two methods escape() and unescape() which lets us do this operation.

1. Escape

First, let us see how to escape HTML strings and files using the escape() method from the HTML module.

What is escaping in HTML?

A character escape is a way of representing a character in source code using only ASCII characters. In HTML we can escape ‘<’ as ‘&lt;’ and ‘>’ as ‘&gt;

Escape HTML strings

First, we have to import the escape() method from the html module. This method will take a string of HTML code and returns the escaped code as a string.

The output of the above code is

&lt;!DOCTYPE html&gt;
&lt;html lang=&quot;en&quot;&gt;
&lt;head&gt;
&lt;meta charset=&quot;UTF-8&quot;&gt;
&lt;title&gt;Title&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
&lt;p&gt;This is a sample string! &lt;/p&gt;
&lt;/body&gt;
&lt;/html&gt;

As you can see the non-ASCII characters are converted into their ASCII equivalent.

We can also get the contents of the Html file as a string using the read() method and provide it to this method like this.

Image for post
escaping HTML code from a file using python

The output of this code is the same as the above output as I have added the same code to the sample.html file.

2. Unescape

The unescaping concept is opposite to that of the escape. For this purpose, we will be using the unescape() method from the html module.

What is unescaping in HTML?

In this mechanism, the ASCII characters are converted back to their original format as valid HTML tags and elements. The ‘&lt;’ character will now become ‘<’ and ‘&gt;’ will now become ‘>’

The unescape() method

The unescape() method will take the escaped HTML string as input and return us the original HTML string as the output.

Import the unescape() method from the html module.

The output of this code is

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<body>
<p>This is a sample string! </p>
</body>
</html>

Advantages of escaping and unescaping

Escaping and unescaping is useful to prevent Cross Site Scripting (XSS) attack. It is one of the common web attacks since it will be easy to create an attack vector if the site is not designed carefully.

For example, let us say you have a web page that accepts the user to enter his address and you want the user to confirm it on the next page. So, you are getting the address entered by the user and displaying it on the next page. If the user enters a valid address, it will not be a problem. What if the user enters something like this

<script>
alert("Welcome");
</script>

Your next page will simply produce an alert box saying Welcome. Now, consider this case. You are writing a blogging application, and the user enters the above-seen script in the text box provided. You’ll be storing it in DB and whoever wants to see your blog will get to see that alert box. The worst thing is if the attacker puts that in an infinite loop, whoever visits that blog will not be able to read the content at all.

This is just one of the basic attacks, which is possible if you don’t escape the text.

So, normally, the text user entered will be escaped and then stored in DB. For example, the above-seen attack vector (the script tag thing) will become like this, after HTML escaping

&lt;script&gt;<br/>  
alert(&quot;Welcome&quot;);<br/>
&lt;/script&gt;

Now, the browser will not consider this as a script element but an HTML element, so it will display it as

<script>
alert("Welcome");
</script>

instead of executing it.

Conclusion

Hope this article is helpful. If you have any queries leave them in the comments below. Happy coding!