Pastebin
Requirements
Functional
- User should be able to paste data upto 1MB size
- Customized urls
- Expiration (Optional to ask to user or a preset defined)
- Support text based data only
- User is able to set visibility of a url to public or private
- User can delete the paste as well if created by him only
Non functional
- High availability
- Low latency
- Durability
Estimation
- Per day write - 1M
- read/write ratio - 10:1
- Avg paste size - 100 KB
- Max paste size - 1 MB
Traffic
| Desc | value |
|---|---|
| QPS(write) | (100k)/(sec in one day) = 10 (approx) |
| QPS(read) | 10*10 = 100 |
| Peak QPS(read) | 40 (assumption) |
Storage
| Desc | value |
|---|---|
| storage per day | 1MBx1M = 1 TB |
| storage for 5 year | 1 GB x 365 x 5 = 1.8 PB |
Bandwidth
| Desc | value |
|---|---|
| Ingress | 1MB/paste x1M paste/day x 10^(-5)day/sec = 10 MB/sec |
| Egress | 1MB/paste x10M reads/day x 10^(-5)day/sec = 100 MB/sec |
Memory(Caching)
We can follow the 80/20 rule while caching where 80% of the trafic is server by 20% cached result and rest can be done from server. We can use cache with a TTL of 1 day.
| Desc | value |
|---|---|
| cache size | 1MB/paste x10M paste/day x 0.2 = 2 TB |
API Design
We can use REST for ease of loose coupling and easiness to debug.
Create paste
Request
/pastes
method: POST
authorization:...
{
name:<string>,
content:<string>,
visibility:<enum>,
custom:<string>, (optional)
expiry:<dattime> (optional)
}
Response
201
{paste-id-url}
401 - unauth
Get paste
Request
/pastes/:paste-id
method: GET
authorization:...
Response
200
{
name:<string>,
content:<string>,
visibility:<enum>,
id:<string>, (optional)
expiry:<dattime> (optional),
s3_link:<string>
}
401 - unauth
Delete paste
/pastes/:paste-id
method: DELETE
authorization:...
Database
- As there it is going to be a lot of data so we need to make decision which DB we want to use. I am planning to choose SQL because of this point
- Strict schema
- relational data
- need of complex joins
- lookup by index
- Also there is size very huge so we can store huge content on s3 object storage as it will cost effectively and reduce DB io.
- Some alternative to S3 is mongodb.
HLD
Encoding
- we need to encode our pasteId into some format for readability. we can use base58 format.
- base58 is similar to base62 bur it doesn't contain non-distinguishable char like (
O (uppercase O), 0 (zero), and I (capital I), l (lowercase L)). - so range of base58 - A-z a-z 0-9 (exclude above four char)
- Total paste id possible for a 8 length base58 id = 58^8 = 128 trillion
Write paste
- Single machine solution will not scaleout so we move the Key Generation service (KGS) outside as a service.
- Operations done when client enters a paste
- write call is rate limited
- KGS creates a unique encoded pasteid
- we get pre-signed url
- system requests a pres-signed url from the object storage.
- A presigned URL allows the client to upload content directly to storage without needing authentication each time.
- Paste url is created by appending pasteid.
http://presigned-url/paste-id
- The paste content is transferred directly from the client to the object storage using the paste URL to optimize bandwidth expenses and performance
- The object storage persists the paste using the paste URL
- The metadata of the paste including the paste URL is persisted in the SQL database
- The server returns the paste ID to the client for future access
- Some other things we can perform on server
- Huffman encoding for reducing text size
- content encryption
- Use bloom filter in case if custom url is requested, we can check if custom url is already present or not
- Random ID Generation
- we can pick one of the following multiple approach
- Twitter's snowflake
- MD5 + hashing (something that we followd in url shortener)
- we can pick one of the following multiple approach
Read paste
- Using cache-aside pattern and for eviction use LRU
- We are introducing cache at following level
- CDN (public cache) - reduce load on system
- internal cache on data store
- client side cache (browser)
- API gateway handles
- rate limiting
- Auth
- compressing
- filtering
- Bloom filter is used to avoid cache thrashing, we set bloom fiters when an element accessed more than twice, and if bloom found for paste then only it will be written to cache.
- We will shard the DB based on user_id
Deep dive
Scalability
- Scalaing a system is a iterative process we can continuously perform these action
- benchmark or load test
- profiling for bottlenecks and SPOF
- We can keep the service stateless for horizontal scaling
Rate limiting
- to prevent malicious attacks, we can use following entities to identify
- token for loggedin user
- request from free and premius clients are throttled diff based on the plan
Availability
Can be approved by following things
- LB run in active-active or active-passive mode
- same like KGS also run in both of above mentioned
- Backup DB atleast once a day
Fault Tolerance
- we are already using MS arch so it will improve fault tolerance
- we can durther can implement circuit breaker and backpressure
Analytics
- setup a Kafka for analytics purpose and process event asynchronously
DB cleanup
We can remove expired data to reduce data cost by
- lazy removal
- dedicated cleanup service
- timer cleanup service