System Design - Dating Advice

Preface

When designing a large system a few things need to be considered: architectural pieces, how they fit together, and how can we best use them / what are their tradeoffs? It’s wise to anticipate and plan for scaling before it is needed, which can save time and resources in the future.

The key characteristics of a Distributed Systems are:
- Scalability: maintaining performance with more load
- Reliability: maintaining service if a component goes down = Availability / Time
- Availability: % of time running
- Efficiency: Latency, Bandwidth, network topology / load, etc…
- Manageability: Time to fix system

In the sections below, I will address some of those characteristics with an app MVP that I designed and built

Note: The MVP GitHub code contains the code for creating a simple user account and uploading an image with text based on AWS architecture described with EC2, S3, RDS, VPC, Github, Route53, LB, and ASG.

Dating Advice – Why do we need it?

Professor Albert Mehrabian’s 7 – 38 – 55 rule states that verbal communication only accounts for 7% of personal communication.

Sometimes we don’t know how to respond or continue a conversation on a digital app due to lack of non-verbal cues, and need advice how to reply next. The idea is to allow users to seek advice and share their advice on the conversation of others.

Functional Requirements

A user can upload an snapshot of a conversation
A user can search for advice to conversations
A user can add advice to a conversation

Note: ‘Advice’ and ‘Suggestion’ are used interchangably

Use Cases

User uploads an image of a conversation
The image is parsed and converted into text

User activates conversation search by uploading an image of a conversation or text
System displays uploaded conversations that match

User selects conversation from search results
System displays conversation and any existing advice

User adds advice to conversation
System adds new advice to conversation

High-Level Design

The high-level design is split into two parts: Optical Character Recognition (OCR) and integration with a quora or twitter like engine
- OCR will take in a user uploaded image of a conversation and translate it into text
- The text will be sent to the engine to be added into storage and may return a list of matching conversations
- Given a conversation selection, the server will query the engine to return conversation advice
- User may add a new advice to the selected conversation

Reusing an existing engine that meets our requirements will save us time in design, build, and maintenance. Although individual design components still need to be analyzed for a cost comparison.

If we cannot find a suitable engine then we’ll have to build our own with the following high-level designs and user flows.

Use Case Design Flow Diagram

use-case-flow-diagram — Use Case Flow Diagram

Engine Flow Diagram

Technical Requirements

System should take in an image, parse it, and return the text content
Should be Reliable, but consistency can take a hit
Should be Highly Available for resiliency and scalability
Search time should not increase with increased data size

Out of Scope
- News Feed
- Log In

Capacity Estimations

Assumption: For simplicity, the annual user base is constant. But we need to account for traffic spikes

Traffic Estimates
- 50 million (m) users, 5 m daily users
- 5 m sessions a day = 58 sessions / second (s)
- 1 : 4 PUT / GET ratio
  - 1 m daily object uploads / puts
    - ~11.5 / s
    - 1 : 2 ratio for conversations : suggestions
  - 4 m daily object / gets
    - ~46 / s
- NOTE: For an accurate estimate, we should go through each flow in the diagram to determine the exact amount of network/storage PUTS/GETs per session

Storage Estimates

I will discuss some tradeoffs between storing objects as Images vs Text

If we do not convert Image to Text here are some storage estimates
1 : 2 Image / Text suggestion ratio for 1 m PUTS
- Average photo size = 400 KB
  - 1m * 33% * 400 KB = 132 GB / Day
- Average text size = 640 bytes
  - 1m * 66% * 640 bytes = 0.43 GB / Day
Daily PUT amount = 132 + 0.43 GB = 133 GB / Day
Total space for 10 years = 365 days * 10 years * 133 GB = ~485 TB
To keep some margin, we assume 80% capacity model, which raises our storage needs to ~607 TB

If we do convert Image to Text here are some storage estimates
1 : 2 Conversation / Text suggestions ratio for 1 m PUTS
- Average conversation size = 3 KB
  - 1m * 33% * 3 KB = 1 GB / Day
- Average text size = 640 bytes
  - 1m * 66% * 640 bytes = 0.43 GB / Day
Daily PUT amount = 1 + 0.43 GB = 1.43 GB / Day
Total space for 10 years = 365 days * 10 years * 1.43 GB = ~5.2 TB
To keep some margin, we assume 80% capacity model, which raises our storage needs to ~6.5 TB

The cost difference between storing the data as images vs Text over 10 years is ~100x. We do however need to develop a good Image to Text algorithm.

Bandwidth Estimates

Images
PUT = 11.5/s * (33% * 400KB) + 11.5/s * (66% * 640B) = 1.5 MB/s
GET = 46/s * (33% * 400KB) + 46/s * (66% * 640B) = 6 MB/s

Text
PUT = 11.5/s * (33% * 3 KB) + 11.5/s * (66% * 640 B) = 16.2 KB/s
GET = 46/s * (33% * 3 KB) + 46/s * (66% * 640 B) = 65.5 KB/s

6 MB / 65.5 KB = 92x difference

Cache Estimates

Following 80/20 rule, we should have a cache that can hold 20% of our GET requests to reduce latency for 80% of our traffic

For Images
Average GET package is an image with 2 texts
20% * 4m * (400 KB + 640B + 640B / 3) = 140 GB Daily Cache

For Texts
Average GET package is a conversation with 2 text suggestions
20% * 4m * (3 KB + 640B + 640B / 3) = 1.1 GB Daily Cache

Least Recently Used (LRU) policy will discard the LRU chunk first
- First In First Out (FIFO), Last In First Out (LIFO), Most Recently Used (MRU), Least Frequently Used (LFU), and Random Replacement (RR) methods are not ideal for our customers interests
AWS has a Content Delivery Network called CloudFront to host geographically distributed cache to deliver most requested static content
We can employ a Write-Through cache technique where data is written to the cache and corresponding DB at the same time. This has a higher write latency than write-around or write-back techniques but acceptable since our application is read-heavy

MetaData Estimates
- See MetaData Size Estimation Section below

Database Design

We need to store data about users, their records (metadata), and the actual data (upload). Using a relational MySQL DB since we have a organized list of relationships known ahead of time. The tradeoff is that if we ever need to add columns in the future, it’ll cause issues.

The Author table contains the User ID and their Name
The Record table holds all Record metadata. Data Types can be different, so there are different types of records that could refer to one another:
- Data_id is the PK to the Data Table where the object metadata is stored
- Ref_data_id is a reference to another record. I.e. A text comment refers to an Image record. While an Image record refers to itself
- Author_id links back to the Author PK who did the action (author id)
- and the timestamp of the record
The Data table contains the Object metadata
- Type_Id to determine if is an image or text object
- Path to S3 where the object is located

This was designed with scalability in mind. Since we have an index, it’s relatively fast to find any data point we’re looking for based on author, record or data ID.

Note: This structure is uniform regardless if the data type is an image or a text comment in order to keep future flexibility of additional types of data. Indexing is helpful for search queries and read-heavy applications. Since it adds additional storage load and processing requirements to keep the index updated for write-requests.

Fetching data from the servers should be paginated depending on the client device. Mobile should be smaller, while web can be larger.

Anticipating potential failover occurrences, I set up a Active-Passive set up with our object storage. All PUTS go to the Active storage first, then they are replicated to the Passive storage in a low access, cheaper storage. By putting a LB in front we can do a health check on the Active storage, if it fails, then traffic will automatically be redirected to the Passive storage. Using an Auto Scaling Group we can also automatically upgrade the Passive into Active, and create a new Active-Passive set up.

This is in contrast to a Master-Slave design, where the Read operations are read from the Slaves instead of the Master. This is good for read-heavy applications, addressing network or disk I/O bottlenecks, and geographic dispersion. However, it requires application logic to separate read and write operations, and replication is asynchronous, so instance are not guaranteed to be in sync.

Data Partitioning

Data partitioning breaks up large DBs into smaller parts to improve manageability, performance, availability, and load balancing an application. It is cheaper and more feasible to scale horizontally than vertically after a certain scale point.

Since we will be storing a lot of data over 10 years, we need to partition our DB

Horizontal Partitioning based on Author ID with Images
- If one DB shard is 4TB. We will have 607 TB / 4 TB = 152 shards
- We can find the shard number by hash(UserID) mod (%) 152 and then store/retrieve the data from there
- Potential issues
  - If a user becomes popular and will have many requests to pull their data from the containing server
  - Maintaining a uniform distribution if a user has a lot of uploads compared to others
  - Adding a new shard will break all existing mappings
- Solution:
  - Repartition
  - Consistent Hashing
    - Use hash function k/n instead of mod, where k = # of keys and n = # of servers. Using virtual replicas it also balances the servers out better

Partitioning based on Record ID
- If we store different records of a user on separate shards, fetching a range of records of a user would increase with more users, however, it would have uniform servers loads.

Partitioning limitations:
- Sharding splits the data across servers so joining tables becomes not feasible. As long as all the data about a object is in 1 location and we don’t need to join with a object in another server, we should be ok. Solutions include: denormalizing tables, adding partitions and rebalancing existing partitions. These add downtime to complete.

Using the same approach with text data (6.5 TB over 10 years), we won’t need more than 2 shards and go with sharing based on Record ID.

Metadata-Estimation

10 year storage totals per table

Per row, Author table has 110 bytes
- Assuming 50 million users, we get 5 GB

Per row, Record table has 44 bytes
- Assuming 1 million uploads a day, we get 44 MB
- For 10 years we need 160 GB

Per row, Data table has 120 bytes
- Assuming 1 million uploads a day, we get 120 MB
- For 10 years we need 440 GB

Total table space for 10 years will be = 55 GB + 160 GB + 440 GB = 605 GB

This does not take data compression (JPEGs are already compressed, and not recommended for encryption) and replication into consideration

System APIs

See www.github.com/efinesbu/webapp for code
- createNewId(name), createNewDataRecord, createNewRecord(author_id, data_id, ref_data_id=None)
  - Populate Tables
- getDataRecord(data_id)
  - Get upload object path in S3
- getRecord(record_id)
  - Get object metadata

Engine API
- CreateQuestion(api_dev_key, text, author_id)
- SearchQuestions(api_dev_key, text, daterange = None, sort, max_result_count)
- AddSuggestion(api_dev_key, text, question_id, author_id)

API Metrics

AWS CloudWatch can help track metrics to help calibrate resource planning and optimization
- Amount and size of Read/Write IOPS
- VolumeQueueLength
- Latency

Cloud Diagram

VPC Diagram

For my design I have chosen to use AWS products by creating a Virtual Private Cloud (VPC), with 2 subnets and Simple Storage (S3). The private subnet contains the Relational Database Service (RDS) to host our metadata, and S3 to hold the objects themselves.

Route53

Write calls can take more time than Read calls since they always require access to the disk, while Read calls can be served from cache.

For this reason Route 53 can split write/read calls into writer/reader nodes to different servers. This doubles up as a security feature too, where certain users have limited access

This also allows us to scale and optimize each of these operations independently

Route53 makes suggestions based on DNS and work cross-region. Load Balancers actually direct traffic and focus within a region.

Load Balancing & Auto Scaling Groups

AWS has 3 types of Load Balancers, we will be using the Application LB (ALB) which operates at request level 7, meaning it allows us to route traffic based on content, and even rewrite it.
- Network LB operate on connection level 4, best for millions of requests per second, which we don’t need
- Classic LB (CLB) but does not have Health Checks which are critical for identifying servers that are down

Performance Metrics via CloudWatch to determine if we need more replication, load balancing or caching
- Request Count – # of requests and returned responses by LB
- Error Count
- Surge Queue Length – # of requests received by LB but not processed due to lack of resources
- Latency – time between receipt of request to LB and return of response

Our LB will also point to an Auto Scaling Group
- This will be configured to maintain enough instances to handle the average traffic of 100k daily users (need to update to 5m daily users)
- It will be configured to scale incrementally when the maintenance level hits 80%. This should give us some time to get the new instances set up to take on additional incoming traffic
- Assuming our servers can handle 25k concurrent connections, and the average concurrent use is 55k, we would always need 3 instances running with 80% capacity
  - Server 1: 20k connections
  - Server 2: 20k connections
  - Server 3: 15k connections
- When we hit 20k (80%) connections on the 3rd instance, a new instance will be spun up
- We could look at historical data to determine if we would need to spin up more than 1 instance for heavy peak loads of more than 15-20k in a short time frame
- We could also consider distributing the traffic across instances based on how it impacts performance

Load Balancers on AWS distributes traffic from clients to our servers based on Listeners, which check connection requests from clients, using the protocol and port that I configure. This will enable us to distribute loads from clients and engineers to different servers with different entitlement settings to improve security.

Health Checks are performed on all servers to make include or remove it from our LB selection

If this is becomes a mobile app we may want to consider keeping 2 different code bases. This could be managed by putting a LB in front

Reliability, Redundancy, and Cache

Losing files is not an option for our service. Creating redundancy in a system can remove single points of failure and provide a backup or spare functionality if needed in a crisis

For reliability, we can turn on cross-region replication in another region to have all S3 data replicated in another data center using S3 Infrequent access to save costs.

Since this service needs to deliver to globally distributed users efficiently, our service should push content closer to the user using Content Delivery Networks (CDNs) such as AWS’s CloudFront.

We can have multiple secondary database servers for each DB partition for read traffic only. All writes will go to the primary server and then replicated to secondary servers. This will give us fault tolerance, since whenever the primary server goes down we can failover to a secondary server

Security & Infrastructure

Route53 can distinguish engineers and users if you set up distinct URLs for access. With this distinction you can have tighter control over access entitlements

Load Balancers can change the target IP address so the client never knows the server IPs and implement access controls

A separate MySQL file should be created to parse the DB password file. This keeps the password out of the main code files

A VPN should be set up with two subnets.
- The public one will contain our compute (EC2) resources, web security group and NAT Gateway
- The private one will contain the RDS and SQL security group
- Both will be prefaced with a Network Access Control List and a Public and Private Route Table before an Internet Gateway

The VPN will be located in an Availability Zone, as determined by the Elastic Load Balancer and Auto Scaling Group policies

Device type needs to be given consideration for configuring front end functions i.e. iOS vs Windows commands to open a file

Software engineers will be able to SSH into the Writer Node of Route53 for maintenance

Design Costs

1.42x Difference / Month: +$1650 for Images, +$942 for Text.

S3 Storage = $142 for Images, $1.54 for Text
- First 50 TB / m 2.3 cents / GB
  - IMAGES: 133 GB / day * 30 days * 2.3 cents = $92 / month
  - TEXT: 1.43 GB / day * 30 days * 2.3 cents = $1 / month
- Cross – Region Replication Storage using S3-Infrequent Access @ 1.25 cents / GB
  - IMAGES: 133 GB / day * 30 days * 1.25 cents = $50 / month
  - TEXT: 1.43 GB / day * 30 days * 2.3 cents = $0.54 / month

S3 Requests = $200 / m
- PUT 0.5 cents / 1,000 requests
  - 1 million / day / 1000 requests * 0.5 cents * 30 days = $150 / m
- GET 0.4 cents / 10,000 requests
  - 4 million / day / 10,000 requests * 0.4 cents * 30 days = $50 / m

S3 Data Transfer = $580 for Images, $16 for Text
- INTO S3 From US EAST Region = 10 TB / m @ 9 cents / GB
  - IMAGES: 133 GB / day * 30 days * 9 cents = $360 / m
  - TEXT: 1.43 GB / day * 30 days * 9 cents = $3.86 / m
- OUT of S3 INTO US EAST Region = 1 cent / GB
  - IMAGES: 4m * (400 KB * 33%) + 4m * (660B * 33%) * 2 * 1GB / 1,000,000 KB * 1 cent / GB * 30 days = $210 / m
  - TEXT: 4m * (3 KB * 33%) + 4m * (660B * 33%) * 2 * 1GB / 1,000,000 KB * 1 cent / GB * 30 days = $1.71 / m
- Lifecycle Transition Requests into S3 Infrequent Access @ 1 cent / 1k requests
  - 1 million GET requests / 1k * 1 cent = $10

EC2 Configurations = $481 / m
- 3 instances for base level
  - 4 vCPUs | 16 MB RAM | Linux | T3a.xlarge EC2
- On-Demand 15.04 cents / hour, or 0.943 cents / hour for annual plan
  - Elastic Block Storage
    - General Purpose SSD (gp2) | 60 GB
    - 10 cents / GB * 60 GB / Month = $6 * 3 instances = $18 / m
  - 15 cents * 770 hours / m = $115 * 3 instances = $330 + 18 = $348 / m
- Approximating 2 additional EC2 instances at half capacity for demand spikes
  - 2 instances * 115 / instance + $6 storage * 2 * 50% uptime = $141 / m

RDS Costs = $210 / m
- Db.t3.xlarge 27.2 cents / hr
  - 27.2 cents * 770 hours / m = $210 / m
- Storage Multi-AZ General Purpose SSD @ 23 cents / GB
  - 5 GB / day * 30 days = $150 / m

RDS Transfer = $2.7 / m
- Incoming data is free
- OUT to internet = 9 cents / GB for 10 TB / m
  - 4m * (44 bytes + 120 bytes + 110 bytes) 9 cent / GB * 30 days = $2.7 / m

Application Load Balancer = $24 / m
- 2.25 cents / hour
- 0.8 cents / hour for Load Balance Capacity Units (Bandwidth
- (2.25 + 0.8) * 770 hours = $24 / m

Route53 = $1 / m
- 50 cents per hosted zone / m * 2 zones = $1

CloudWatch Metrics = $6 / m
- First 10k metrics @ 30 cents / metric
  - 3 instances * 7 metrics * 30 cents = $6 / m

CloudFront / CDN (out of scope)
- Depends on traffic configurations between CDN, Route53, LB, and regions

Other
- Image to Text Algorithm
  - $100,000 over 6 months
- Vendor (Quora like) API
  - Referring to AWS REST API costs of $3.50 / 300 million requests / M = $3.50
  - + Commercial Contract

AWS Prototype Set-up Instructions

Step 1) Create S3 bucket: ny-s3fs-bucket

Step 2) Create Role: S3FS-Role with the following policy

{

“Version”: “2012-10-17”,

“Statement”: [

{

“Effect”: “Allow”,

“Action”: [

“s3:ListBucket”

“Resource”: [

“arn:aws:s3:::ny-s3fs-bucket“

]

{

“Effect”: “Allow”,

“Action”: [

“s3:PutObject”,

“s3:GetObject”,

“s3:DeleteObject”

“Resource”: [

“arn:aws:s3:::ny-s3fs-bucket/*”

]

}

]

}

Step 3) Create User access Key Pair & download, or use existing

Step 4) Create EC2 free tier instance: webserver2

Create or use a security group with
- HTTP listening to all 0.0.0.0/0
- SSH listening to MyIP
Set Role to S3FS-Role
Note: Step 6 installations can be automated here

Step 5) Access EC2 vis SSH

Generate public key & use as identity

ssh-keygen -y -f MyKP.pem > MyKPub

ren MyKP.pem > MyKP

Set user: ec2-user
Set host / EC2 Public IPv4: 18.205.29.255
Take note of private IP once logged in
- [ec2-user@ip-172-31-87-215 ~]$

Step 6) Run the following commands

Note: Step 6 can be automated in Step 4 if you expect to do this multiple times (Auto Scaling)

#!/bin/bash

sudo yum update -y

sudo amazon-linux-extras install epel -y

sudo yum install s3fs-fuse -y

sudo yum install python3 -y

sudo yum install git -y

pip3 install flask –user

sudo pip3 install flask

pip3 install mysql-connector-python –user

sudo pip3 install mysql-connector-python -y

sudo pip3 install ec2-metadata

Step 7) Set up credentials for s3fs

echo <access key>:<secret key> > ${HOME}/.passwd-s3fs

echo AKIAYDUNYYT75SNXNE:cR+D40HVgkDxJYP8f1JZogG7MUV7t6Ln5idFn > ${HOME}/.passwd-s3fs

Chmod 600 ${HOME}/.passwd-s3fs

Step 8) Set up S3FS mount

mkdir mountstorage

s3fs ny-s3fs-bucket /home/ec2-user/mountstorage -o passwd_file=${HOME}/.passwd-s3fs

Step 9) Create virtual environment if needed (optional but recommended)

Skip

Step 10) Create Github directory

Mkdir github

Cd github

Step 11) Pull code

1st time: git clone https://github.com/efinesbu/webapp
Updates: git pull https://github.com/efinesbu/webapp
Set up cron job to have automatic updates

Step 12) Run code

Sudo Python3 mvpservice.py

Step 13) Set Up DB

Add config.ini file
Add python_mysql_dbconfig file to parse config.ini and import into main
Update DB security group to include internal/private EC2 IP

Emil Fine