You must have come across URL shortener such as tinyurl, shorturl, bitly etc. In this post we’ll understand how these tools internally work and how can we design a URL shortener. We’ll be using all the concepts that we discussed in the previous posts.
Requirements:
- Convert long URL into short URL
- URL Redirecting from Short URL to Original URL
- System should be highly available, scalable and fault tolerant.
- On an average 100 million URLs will be generated/day.
- ShortURL can only contain combination of numbers (0-9) and characters (a-z, A-Z).
Back of the envelope estimation
- 100 million URLs generated/day so ~100 million writes/day.
- Write operations per second 100 million / 24 / 3600 -> 1.16k
- Read Operations: Usually ratio of write to read operation is 1 : 10 so reads/second would be 1160 * 10 -> 11,600
- Records created in 10 years : 100 million * 365 * 10 = 365 billion records.
- Storage requirement for approx. 10 years considering average URL length as 100 would be: 365 billion * 100= 36.5 TB
API-Endpoints Required
API stands for Application Programming Interface. In the context of APIs, the word Application refers to any software with a distinct function. Interface can be thought of as a contract of service between two applications. This contract defines how the two communicate with each other using requests and responses.
- POST: api/v1/url/shortener -> This will consist of long URL as the request parameter and return short URL
- GET : api/v1/shortURL -> It will return long URL for HTTP redirection.
User will visit the short URL onto browser and server will visit short URL request and then it will change short URL to the long URL with 301 redirect. 301 HTTP status code is used for temporary redirect.
301 redirect means that the requested URL is “permanently ” moved to the new location and now when you will try to visit the old URL you’ll get 404 not found. Incase of URL shorteners since request is permanently redirected, the browser caches the response and the subsequent requests for the same URL will not be sent to the URL shortening server instead requests will be redirected to the long server directly.
302 redirect means URL is temporarily moved to a new URL location. So, subsequent calls for the same URL will be first sent to the URL shortener and then they will be redirected to the long URL server.
So which redirect method should you use? 301 or 302, incase 301 redirect only first request goes once to the shortening server. In this case analytics will be lost as after first request URL shortener will not know about it if you want to track click rate and source of the click 302 redirect is better.
Using hash tables basic URL shortener can be created, hash table with store <shortURL, longURL> and redirect will happen in two steps : 1. Getting long URL from hash table 2. Performing URL redirect. Before storing shortURL inside a hashtable we need to first generate it. Whenever we receive longURL we need to convert into shortURL.
URL Shortening
To convert a long URL into a short URL and then getting back long URL from short URL to perform redirect we need have a hash function that will map the long URL to the hash function.
Hash function as earlier said will have following requirements:
- Each long URL must be hashed to one hashValue
- Each hashValue can be mapped back to long URL.
Using hash tables for storing <shortURL,longURL> is not scalable and feasible for real-world distributed systems and storing everything in-memory is not possible. Using relational database to store mapping <shortURL,longURL> is a better option. To get started will keep things simple and table will contain 3 columns : id, shortURL, longURL
Selecting suitable hash function
As per our requirement hashValue can only consist of characters from [0-9, a-z, A-Z] so total possible characters are: 62 (10+26+26).
We need to create 365 billion records in 10 years and the smallest length of hashValue to support this would be 62^n >= 365 billion, with n=7 we will have approx. 3.5 trillion number of URLs which is more than 365 billion.
Hashing and Collision Resolution
Implement a hash function that hashed long URL to 7-character long string. Well-known functions such as CRC32, MD5, SHA-1, etc. can be used as a straight forward solution.
Even the shortest hash function CRC32 is more than 7 characters long. So to use it only use first 7 characters of the hash value but this will lead to collisions. So, how can you avoid collisions? You can append a new pre-defined string until no more collisions are discovered.
There are few problems with the above approach: Checking if shortURL exists in DB for every request is an expensive query. Eliminating collisions will simply increase the number of such DB calls. Better option to check if something exists in DB or not is to use Bloom Filter.
Base Hash Function Conversion
Base conversion helps to convert the same number between its different number representation systems. Base 62 conversion will be used as there are 62 possible characters for hashValue. Thus, base 62 is a way of using 62 characters for encoding. Learn more about base 62 here.
Short URL length incase of base62 conversion is not fixed it goes up with ID and it is dependent upon a unique ID generator. Collisions can not happen in this case as unique ID is generated every time. How to generate Unique ID in distributed systems has already been discussed in previous lessons.
URL Redirecting
- User will enter shortURL onto web browser.
- Load balancer will forward the request to the webservers.
- If shortURL is already inside cache return longURL directly.
- If not present in cache fetch longURL from the DB. Incase not present in the DB then URL entered would most probably be invalid and return 404.
- Long URL will be returned to the User.
Hello would you mind letting me know which webhost you’re utilizing?
I’ve loaded your blog in 3 completely different web browsers and
I must say this blog loads a lot faster then most.
Can you recommend a good hosting provider at a honest price?
Thank you, I appreciate it!
I’m using hostinger premium to host my website.
Fantastic website. Lots of helpful information here.
I’m sending it to a few pals ans also sharing in delicious.
And obviously, thank you to your effort!
Thank you!