Guide to file uploads

The Old Way

Back then, file uploads were simple. Files went straight to your server. This caused problems:

Server gets overwhelmed
Security risks
High bandwidth costs
Timeouts with big files

Pre-signed URLs: A Better Way

Instead of uploading directly to your server, we now use pre-signed URLs. Here's how it works:

Client asks your server for permission to upload
Server generates a temporary URL from S3
Client uploads directly to S3
S3 returns a response (HTTP 200) to the client when upload is successful
Client tells your server it's done
Server gets the permanent URL from S3
Stores the URL in the database (or wherever you need it)

Pre-signed URLs are safer because they expire quickly and have specific permissions. Plus, your server doesn't touch the actual files.

Handling Big Files with Chunks

Big files need special handling. We split them into smaller pieces:

Break the file into chunks (like 5MB each)
Create a fingerprint for each chunk
Track each chunk's status

Fingerprinting is key here. It creates a unique ID for each chunk based on its content. This helps with:

Finding duplicate chunks
Checking if chunks are corrupt
Picking up where you left off if upload fails

Verifying Uploads

We have two ways to check if chunks actually made it to S3:

Client tells us (Client-driven):
- Client uploads to S3
- S3 confirms to client
- Client tells our server
- Server double-checks with S3
S3 tells us directly (Storage-driven):
- Client uploads to S3
- S3 automatically tells our server
- Server updates its records

The second way (S3 events) is better because:

More reliable
Faster
Simpler code
Can't be faked

S3 Multipart Uploads

S3 built this chunking system right in. Their multipart upload:

Handles chunks for you
Puts files back together
Checks everything worked
Lets you pause and resume
Uses network better

How S3 multipart upload works (full flow explained)

I realized my notes were quite ambiguous and wanted to extend the post with a separate section.

Let me break down the flow step by step:

Initial Request from client to server
- Client sends: file metadata (name, size, total chunks)
- Server creates records in database:
  - New row in files table with status="pending"
  - Creates file_chunks entries for each expected chunk
- Returns file_id to client
Pre-signed URLs
- Client requests to the server pre-signed URLs for chunks it's ready to upload
- Server generates (by getting it from S3) URLs with specific paths: uploads/{file_id}/chunks/{chunk_number}
- These paths aren’t random. They are placed in this specific order. This way we can determine which chunk belongs to which file.
Chunk Upload Flow
- Client now has the pre-signed URLs it requested
- Client uploads chunk to S3 using pre-signed URL
- S3 responds to client with success + ETag
- S3 (S3 Event Notifications) sends event to SQS/SNS with:

    {
      "eventName": "ObjectCreated:Put",
      "s3": {
        "object": {
          "key": "uploads/file_123/chunks/1",
          "eTag": "..."
        }
      }
    }

Server processes event:
- Parses file_id and chunk_number from the S3 key
- Updates file_chunks status to "uploaded"
- Updates files status if all chunks are uploaded

Completion
- Client sends completion request to the server when all chunks uploaded
- Server:
  - Verifies all chunks are present in DB
  - Initiates S3 multipart completion request
  - Updates file status to "completed"

Completion request

ETags and the number of the chunk are needed for the S3 CompleteMultipartUpload API call. S3 needs:

Each part's ETag (which S3 generated when each chunk was uploaded)
The correct part number for each chunk
All parts must be in order

Example of what S3 requires for completion:

await s3.completeMultipartUpload({
  Bucket: "my-bucket",
  Key: "final/file/path",
  UploadId: "...", // from when multipart upload was initiated
  MultipartUpload: {
    Parts: [
      { PartNumber: 1, ETag: "etag1..." },
      { PartNumber: 2, ETag: "etag2..." },
      { PartNumber: 3, ETag: "etag3..." }
    ]
  }
}).promise();

S3 uses these ETags to:

Verify all chunks were uploaded successfully
Verify chunks weren't corrupted
Assemble the final file in correct order

This is why we store ETags in our database as each chunk is uploaded. We need them in the end for this final step.

File upload (notes)

Table of contents