Emil Fine

Personal Website

System Design – Dating Advice

Preface

  • When designing a large system a few things need to be considered: architectural pieces, how they fit together, and how can we best use them / what are their tradeoffs? It’s wise to anticipate and plan for scaling before it is needed, which can save time and resources in the future.
  • The key characteristics of a Distributed Systems are:
    • Scalability: maintaining performance with more load
    • Reliability: maintaining service if a component goes down = Availability / Time
    • Availability: % of time running
    • Efficiency: Latency, Bandwidth, network topology / load, etc…
    • Manageability: Time to fix system
  • In the sections below, I will address some of those characteristics with an app MVP that I designed and built
  • Note: The MVP GitHub code contains the code for creating a simple user account and uploading an image with text based on AWS architecture described with EC2, S3, RDS, VPC, Github, Route53, LB, and ASG. 

Dating Advice – Why do we need it?

  • Professor Albert Mehrabian’s 7 – 38 – 55 rule states that verbal communication only accounts for 7% of personal communication.
  • Sometimes we don’t know how to respond or continue a conversation on a digital app due to lack of non-verbal cues, and need advice how to reply next. The idea is to allow users to seek advice and share their advice on the conversation of others.

Functional Requirements

  • A user can upload an snapshot of a conversation
  • A user can search for advice to conversations
  • A user can add advice to a conversation
  • Note: ‘Advice’ and ‘Suggestion’ are used interchangably

Use Cases

  • User uploads an image of a conversation 
  • The image is parsed and converted into text
  • User activates conversation search by uploading an image of a conversation or text 
  • System displays uploaded conversations that match
  • User selects conversation from search results
  • System displays conversation and any existing advice 
  • User adds advice to conversation
  • System adds new advice to conversation

High-Level Design

  • The high-level design is split into two parts: Optical Character Recognition (OCR) and integration with a quora or twitter like engine
    • OCR will take in a user uploaded image of a conversation and translate it into text
    • The text will be sent to the engine to be added into storage and may return a list of matching conversations 
    • Given a conversation selection, the server will query the engine to return conversation advice
    • User may add a new advice to the selected conversation
  • Reusing an existing engine that meets our requirements will save us time in design, build, and maintenance. Although individual design components still need to be analyzed for a cost comparison. 
  • If we cannot find a suitable engine then we’ll have to build our own with the following high-level designs and user flows.
high level design
High-Level Design

Use Case Design Flow Diagram

Use Case Flow Diagram

Engine Flow Diagram

Technical Requirements

  • System should take in an image, parse it, and return the text content
  • Should be Reliable, but consistency can take a hit
  • Should be Highly Available for resiliency and scalability
  • Search time should not increase with increased data size
  • Out of Scope
    • News Feed
    • Log In

Capacity Estimations

  • Assumption: For simplicity, the annual user base is constant. But we need to account for traffic spikes
  • Traffic Estimates
    • 50 million (m) users, 5 m daily users
    • 5 m sessions a day = 58 sessions / second (s)
    • 1 : 4 PUT / GET ratio
      • 1 m daily object uploads / puts
        • ~11.5 / s
        • 1 : 2 ratio for conversations : suggestions
      • 4 m daily object / gets
        • ~46 / s
    • NOTE: For an accurate estimate, we should go through each flow in the diagram to determine the exact amount of network/storage PUTS/GETs per session
  • Storage Estimates
  • I will discuss some tradeoffs between storing objects as Images vs Text
  • If we do not convert Image to Text here are some storage estimates
  • 1 : 2 Image / Text suggestion ratio for 1 m PUTS
    • Average photo size = 400 KB
      • 1m * 33% * 400 KB = 132 GB / Day
    • Average text size = 640 bytes
      • 1m * 66% * 640 bytes = 0.43 GB / Day
  • Daily PUT amount = 132 + 0.43 GB = 133 GB / Day
  • Total space for 10 years = 365 days * 10 years * 133 GB = ~485 TB
  • To keep some margin, we assume 80% capacity model, which raises our storage needs to ~607 TB
  • If we do convert Image to Text here are some storage estimates
  • 1 : 2 Conversation / Text suggestions ratio for 1 m PUTS
    • Average conversation size = 3 KB
      • 1m * 33% * 3 KB = 1 GB / Day
    • Average text size = 640 bytes
      • 1m * 66% * 640 bytes = 0.43 GB / Day
  • Daily PUT amount = 1 + 0.43 GB = 1.43 GB / Day
  • Total space for 10 years = 365 days * 10 years * 1.43 GB = ~5.2 TB
  • To keep some margin, we assume 80% capacity model, which raises our storage needs to ~6.5 TB
  • The cost difference between storing the data as images vs Text over 10 years is ~100x. We do however need to develop a good Image to Text algorithm.
  • Bandwidth Estimates
  • Images
  • PUT = 11.5/s * (33% * 400KB) + 11.5/s * (66% * 640B) = 1.5 MB/s
  • GET = 46/s * (33% * 400KB) + 46/s * (66% * 640B) = 6 MB/s
  • Text
  • PUT = 11.5/s * (33% * 3 KB) + 11.5/s * (66% * 640 B) = 16.2 KB/s
  • GET = 46/s * (33% * 3 KB) + 46/s * (66% * 640 B) = 65.5 KB/s
  • 6 MB / 65.5 KB = 92x difference
  • Cache Estimates
  • Following 80/20 rule, we should have a cache that can hold 20% of our GET requests to reduce latency for 80% of our traffic
  • For Images
  • Average GET package is an image with 2 texts
  • 20% * 4m * (400 KB + 640B  + 640B / 3) = 140 GB Daily Cache
  • For Texts
  • Average GET package is a conversation with 2 text suggestions
  • 20% * 4m * (3 KB + 640B  + 640B / 3) = 1.1 GB Daily Cache
  • Least Recently Used (LRU) policy will discard the LRU chunk first
    • First In First Out (FIFO), Last In First Out (LIFO), Most Recently Used (MRU), Least Frequently Used (LFU), and Random Replacement (RR) methods are not ideal for our customers interests
  • AWS has a Content Delivery Network called CloudFront to host geographically distributed cache to deliver most requested static content
  • We can employ a Write-Through cache technique where data is written to the cache and corresponding DB at the same time. This has a higher write latency than write-around or write-back techniques but acceptable since our application is read-heavy
  • MetaData Estimates
    • See MetaData Size Estimation Section below

Database Design

We need to store data about users, their records (metadata), and the actual data (upload). Using a relational MySQL DB since we have a organized list of relationships known ahead of time. The tradeoff is that if we ever need to add columns in the future, it’ll cause issues.

  • The Author table contains the User ID and their Name
  • The Record table holds all Record metadata. Data Types can be different, so there are different types of records that could refer to one another: 
    • Data_id is the PK to the Data Table where the object metadata is stored
    • Ref_data_id is a reference to another record. I.e. A text comment refers to an Image record. While an Image record refers to itself
    • Author_id links back to the Author PK who did the action (author id)
    • and the timestamp of the record
  • The Data table contains the Object metadata
    • Type_Id to determine if is an image or text object
    • Path to S3 where the object is located
  • This was designed with scalability in mind. Since we have an index, it’s relatively fast to find any data point we’re looking for based on author, record or data ID.
  • Note: This structure is uniform regardless if the data type is an image or a text comment in order to keep future flexibility of additional types of data. Indexing is helpful for search queries and read-heavy applications. Since it adds additional storage load and processing requirements to keep the index updated for write-requests.
  • Fetching data from the servers should be paginated depending on the client device. Mobile should be smaller, while web can be larger.
  • Anticipating potential failover occurrences, I set up a Active-Passive set up with our object storage. All PUTS go to the Active storage first, then they are replicated to the Passive storage in a low access, cheaper storage. By putting a LB in front we can do a health check on the Active storage, if it fails, then traffic will automatically be redirected to the Passive storage. Using an Auto Scaling Group we can also automatically upgrade the Passive into Active, and create a new Active-Passive set up.

  • This is in contrast to a Master-Slave design, where the Read operations are read from the Slaves instead of the Master. This is good for read-heavy applications, addressing network or disk I/O bottlenecks, and geographic dispersion. However, it requires application logic to separate read and write operations, and replication is asynchronous, so instance are not guaranteed to be in sync.

Data Partitioning

  • Data partitioning breaks up large DBs into smaller parts to improve manageability, performance, availability, and load balancing an application. It is cheaper and more feasible to scale horizontally than vertically after a certain scale point.
  • Since we will be storing a lot of data over 10 years, we need to partition our DB
  • Horizontal Partitioning based on Author ID with Images
    • If one DB shard is 4TB. We will have 607 TB / 4 TB = 152 shards
    • We can find the shard number by hash(UserID) mod (%) 152 and then store/retrieve the data from there
    • Potential issues 
      • If a user becomes popular and will have many requests to pull their data from the containing server
      • Maintaining  a uniform distribution if a user has a lot of uploads compared to others
      • Adding a new shard will break all existing mappings
    • Solution:
      • Repartition
      • Consistent Hashing
        • Use hash function k/n instead of mod, where k = # of keys and n = # of servers. Using virtual replicas it also balances the servers out better
  • Partitioning based on Record ID
    • If we store different records of a user on separate shards, fetching a range of records of a user would increase with more users, however, it would have uniform servers loads.
  • Partitioning limitations:
    • Sharding splits the data across servers so joining tables becomes not feasible. As long as all the data about a object is in 1 location and we don’t need to join with a object in another server, we should be ok. Solutions include: denormalizing tables, adding partitions and rebalancing existing partitions. These add downtime to complete.
  • Using the same approach with text data (6.5 TB over 10 years), we won’t need more than 2 shards and go with sharing based on Record ID.

Metadata-Estimation

  • 10 year storage totals per table
  • Per row, Author table has 110 bytes
    • Assuming 50 million users, we get 5 GB
  • Per row, Record table has 44 bytes
    • Assuming 1 million uploads a day, we get 44 MB 
    • For 10 years we need 160 GB
  • Per row, Data table has 120 bytes
    • Assuming 1 million uploads a day, we get 120 MB
    • For 10 years we need 440 GB
  • Total table space for 10 years will be = 55 GB + 160 GB + 440 GB = 605 GB
  • This does not take data compression (JPEGs are already compressed, and not recommended for encryption) and replication into consideration

System APIs

  • See www.github.com/efinesbu/webapp for code
    • createNewId(name), createNewDataRecord, createNewRecord(author_id, data_id, ref_data_id=None)
      • Populate Tables
    • getDataRecord(data_id)
      • Get upload object path in S3
    • getRecord(record_id)
      • Get object metadata
  • Engine API
    • CreateQuestion(api_dev_key, text, author_id)
    • SearchQuestions(api_dev_key, text, daterange = None, sort, max_result_count)
    • AddSuggestion(api_dev_key, text, question_id, author_id)

API Metrics

  • AWS CloudWatch can help track metrics to help calibrate resource planning and optimization
    • Amount and size of Read/Write IOPS
    • VolumeQueueLength
    • Latency

Cloud Diagram

VPC Diagram

For my design I have chosen to use AWS products by creating a Virtual Private Cloud (VPC), with 2 subnets and Simple Storage (S3). The private subnet contains the Relational Database Service (RDS) to host our metadata, and S3 to hold the objects themselves.

Route53

  • Write calls can take more time than Read calls since they always require access to the disk, while Read calls can be served from cache.
  • For this reason Route 53 can split write/read calls into writer/reader nodes to different servers. This doubles up as a security feature too, where certain users have limited access
  • This also allows us to scale and optimize each of these operations independently 
  • Route53 makes suggestions based on DNS and work cross-region. Load Balancers actually direct traffic and focus within a region.

Load Balancing & Auto Scaling Groups

  • AWS has 3 types of Load Balancers, we will be using the Application LB (ALB) which operates at request level 7, meaning it allows us to route traffic based on content, and even rewrite it.
    • Network LB operate on connection level 4, best for millions of requests per second, which we don’t need
    • Classic LB (CLB) but does not have Health Checks which are critical for identifying servers that are down
  • Performance Metrics via CloudWatch to determine if we need more replication, load balancing or caching
    • Request Count – # of requests and returned responses by LB
    • Error Count 
    • Surge Queue Length – # of requests received by LB but not processed due to lack of resources
    • Latency – time between receipt of request to LB and return of response
  • Our LB will also point to an Auto Scaling Group
    • This will be configured to maintain enough instances to handle the average traffic of 100k daily users (need to update to 5m daily users)
    • It will be configured to scale incrementally when the maintenance level hits 80%. This should give us some time to get the new instances set up to take on additional incoming traffic
    • Assuming our servers can handle 25k concurrent connections, and the average concurrent use is 55k, we would always need 3 instances running with 80% capacity
      • Server 1: 20k connections
      • Server 2: 20k connections
      • Server 3: 15k connections
    • When we hit 20k (80%) connections on the 3rd instance, a new instance will be spun up
    • We could look at historical data to determine if we would need to spin up more than 1 instance for heavy peak loads of more than 15-20k in a short time frame
    • We could also consider distributing the traffic across instances based on how it impacts performance
  • Load Balancers on AWS distributes traffic from clients to our servers based on Listeners, which check connection requests from clients, using the protocol and port that I configure. This will enable us to distribute loads from clients and engineers to different  servers with different entitlement settings to improve security.
  • Health Checks are performed on all servers to make include or remove it from our LB selection
Load-Balancer Framework
  • If this is becomes a mobile app we may want to consider keeping 2 different code bases. This could be managed by putting a LB in front

Reliability, Redundancy, and Cache

  • Losing files is not an option for our service. Creating redundancy in a system can remove single points of failure and provide a backup or spare functionality if needed in a crisis
  • For reliability, we can turn on cross-region replication in another region to have all S3 data replicated in another data center using S3 Infrequent access to save costs.
  • Since this service needs to deliver to globally distributed users efficiently, our service should push content closer to the user using Content Delivery Networks (CDNs) such as AWS’s CloudFront. 
  • We can have multiple secondary database servers for each DB partition for read traffic only. All writes will go to the primary server and then replicated to secondary servers. This will give us fault tolerance, since whenever the primary server goes down we can failover to a secondary server

Security & Infrastructure

  • Route53 can distinguish engineers and users if you set up distinct URLs for access. With this distinction you can have tighter control over access entitlements
  • Load Balancers can change the target IP address so the client never knows the server IPs and implement access controls
  • A separate MySQL file should be created to parse the DB password file. This keeps the password out of the main code files
  • A VPN should be set up with two subnets. 
    • The public one will contain our compute (EC2) resources, web security group and NAT Gateway
    • The private one will contain the RDS and SQL security group
    • Both will be prefaced with a Network Access Control List and a Public and Private Route Table before an Internet Gateway
  • The VPN will be located in an Availability Zone, as determined by the Elastic Load Balancer and Auto Scaling Group policies
  • Device type needs to be given consideration for configuring front end functions i.e. iOS vs Windows commands to open a file
  • Software engineers will be able to SSH into the Writer Node of Route53 for maintenance

Design Costs

  • 1.42x Difference / Month: +$1650 for Images, +$942 for Text. 
  • S3 Storage = $142 for Images, $1.54 for Text
    • First 50 TB / m 2.3 cents / GB 
      • IMAGES: 133 GB / day * 30 days * 2.3 cents = $92  / month 
      • TEXT: 1.43 GB / day * 30 days * 2.3 cents = $1  / month 
    • Cross – Region Replication Storage using S3-Infrequent Access @ 1.25 cents / GB
      • IMAGES: 133 GB / day * 30 days * 1.25 cents = $50 / month
      • TEXT: 1.43 GB / day * 30 days * 2.3 cents = $0.54  / month 
  • S3 Requests = $200 / m
    • PUT 0.5 cents / 1,000 requests
      • 1 million / day / 1000 requests * 0.5 cents * 30 days = $150 / m
    • GET 0.4 cents / 10,000 requests
      • 4 million / day / 10,000 requests * 0.4 cents * 30 days = $50 / m
  • S3 Data Transfer = $580 for Images, $16 for Text
    • INTO S3 From US EAST Region = 10 TB / m @ 9 cents / GB
      • IMAGES: 133 GB / day * 30 days * 9 cents = $360 / m
      • TEXT: 1.43 GB / day * 30 days * 9 cents = $3.86 / m
    • OUT of S3 INTO US EAST Region = 1 cent / GB
      • IMAGES: 4m * (400 KB * 33%) + 4m * (660B * 33%) * 2 * 1GB / 1,000,000 KB * 1 cent / GB  * 30 days =  $210 / m
      • TEXT: 4m * (3 KB * 33%) + 4m * (660B * 33%) * 2 * 1GB / 1,000,000 KB * 1 cent / GB  * 30 days =  $1.71 / m
    • Lifecycle Transition Requests into S3 Infrequent Access @ 1 cent / 1k requests
      • 1 million GET requests / 1k * 1 cent = $10
  • EC2 Configurations = $481 / m
    • 3 instances for base level
      • 4 vCPUs | 16 MB RAM | Linux | T3a.xlarge EC2
    • On-Demand 15.04 cents / hour, or 0.943 cents / hour for annual plan
      • Elastic Block Storage
        • General Purpose SSD (gp2) | 60 GB
        • 10 cents / GB * 60 GB / Month = $6 * 3 instances = $18 / m
      • 15 cents * 770 hours / m = $115 * 3 instances = $330 + 18 = $348 / m
    • Approximating 2 additional EC2 instances at half capacity for demand spikes
      • 2 instances * 115 / instance + $6 storage * 2  * 50% uptime = $141 / m
  • RDS Costs = $210 / m
    • Db.t3.xlarge 27.2 cents / hr
      • 27.2 cents * 770 hours / m = $210 / m 
    • Storage Multi-AZ General Purpose SSD @ 23 cents / GB
      • 5 GB / day * 30 days = $150 / m
  • RDS Transfer = $2.7 / m
    • Incoming data is free
    • OUT to internet = 9 cents / GB for 10 TB / m
      • 4m * (44 bytes + 120 bytes + 110 bytes) 9 cent / GB  * 30 days = $2.7 / m
  • Application Load Balancer = $24 / m
    • 2.25 cents / hour
    • 0.8 cents / hour for Load Balance Capacity Units (Bandwidth
    • (2.25 + 0.8) * 770 hours = $24 / m
  • Route53 = $1 / m
    • 50 cents per hosted zone / m * 2 zones = $1 
  • CloudWatch Metrics = $6 / m
    • First 10k metrics @ 30 cents / metric 
      • 3 instances * 7 metrics * 30 cents = $6 / m
  • CloudFront / CDN (out of scope)
    • Depends on traffic configurations between CDN, Route53, LB, and regions
  • Other
    • Image to Text Algorithm
      • $100,000 over 6 months
    • Vendor (Quora like) API
      • Referring to AWS REST API costs of $3.50 / 300 million requests / M = $3.50
      • + Commercial Contract 

AWS Prototype Set-up Instructions

Step 1) Create S3 bucket: ny-s3fs-bucket

Step 2) Create Role: S3FS-Role with the following policy

{

    “Version”: “2012-10-17”,

    “Statement”: [

        {

            “Effect”: “Allow”,

            “Action”: [

                “s3:ListBucket”

            ],

            “Resource”: [

                “arn:aws:s3:::ny-s3fs-bucket

            ]

        },

        {

            “Effect”: “Allow”,

            “Action”: [

                “s3:PutObject”,

                “s3:GetObject”,

                “s3:DeleteObject”

            ],

            “Resource”: [

                “arn:aws:s3:::ny-s3fs-bucket/*”

            ]

        }

    ]

}

Step 3) Create User access Key Pair & download, or use existing

Step 4) Create EC2 free tier instance: webserver2

  • Create or use a security group with 
    • HTTP listening to all 0.0.0.0/0
    • SSH listening to MyIP
  • Set Role to S3FS-Role
  • Note: Step 6 installations can be automated here

Step 5) Access EC2 vis SSH

  • Generate public key & use as identity

ssh-keygen -y -f MyKP.pem > MyKPub

ren MyKP.pem > MyKP

  • Set user: ec2-user
  • Set host / EC2 Public IPv4: 18.205.29.255
  • Take note of private IP once logged in
    • [ec2-user@ip-172-31-87-215 ~]$

Step 6) Run the following commands

Note: Step 6 can be automated in Step 4 if you expect to do this multiple times (Auto Scaling)

#!/bin/bash

sudo yum update -y

sudo amazon-linux-extras install epel -y

sudo yum install s3fs-fuse -y

sudo yum install python3 -y

sudo yum install git -y

pip3 install flask –user

sudo pip3 install flask 

pip3 install mysql-connector-python –user

sudo pip3 install mysql-connector-python -y 

sudo pip3 install ec2-metadata

Step 7) Set up credentials for s3fs

echo <access key>:<secret key> > ${HOME}/.passwd-s3fs

echo AKIAYDUNYYT75SNXNE:cR+D40HVgkDxJYP8f1JZogG7MUV7t6Ln5idFn > ${HOME}/.passwd-s3fs

Chmod 600 ${HOME}/.passwd-s3fs

Step 8) Set up S3FS mount

mkdir mountstorage

s3fs ny-s3fs-bucket /home/ec2-user/mountstorage -o passwd_file=${HOME}/.passwd-s3fs

Step 9) Create virtual environment if needed (optional but recommended)

  • Skip

Step 10) Create Github directory

Mkdir github

Cd github

Step 11) Pull code

Step 12) Run code

Sudo Python3 mvpservice.py

Step 13) Set Up DB

  • Add config.ini file
  • Add python_mysql_dbconfig file to parse config.ini and import into main
  • Update DB security group to include internal/private EC2 IP