File upload (notes)

File upload (notes)

Guide to file uploads

The Old Way

Back then, file uploads were simple. Files went straight to your server. This caused problems:

  • Server gets overwhelmed

  • Security risks

  • High bandwidth costs

  • Timeouts with big files

Pre-signed URLs: A Better Way

Instead of uploading directly to your server, we now use pre-signed URLs. Here's how it works:

  1. Client asks your server for permission to upload

  2. Server generates a temporary URL from S3

  3. Client uploads directly to S3

  4. S3 returns a response (HTTP 200) to the client when upload is successful

  5. Client tells your server it's done

  6. Server gets the permanent URL from S3

  7. Stores the URL in the database (or wherever you need it)

Pre-signed URLs are safer because they expire quickly and have specific permissions. Plus, your server doesn't touch the actual files.

Handling Big Files with Chunks

Big files need special handling. We split them into smaller pieces:

  1. Break the file into chunks (like 5MB each)

  2. Create a fingerprint for each chunk

  3. Track each chunk's status

Fingerprinting is key here. It creates a unique ID for each chunk based on its content. This helps with:

  • Finding duplicate chunks

  • Checking if chunks are corrupt

  • Picking up where you left off if upload fails

Verifying Uploads

We have two ways to check if chunks actually made it to S3:

  1. Client tells us (Client-driven):

    • Client uploads to S3

    • S3 confirms to client

    • Client tells our server

    • Server double-checks with S3

  2. S3 tells us directly (Storage-driven):

    • Client uploads to S3

    • S3 automatically tells our server

    • Server updates its records

The second way (S3 events) is better because:

  • More reliable

  • Faster

  • Simpler code

  • Can't be faked

S3 Multipart Uploads

S3 built this chunking system right in. Their multipart upload:

  • Handles chunks for you

  • Puts files back together

  • Checks everything worked

  • Lets you pause and resume

  • Uses network better

How S3 multipart upload works (full flow explained)

I realized my notes were quite ambiguous and wanted to extend the post with a separate section.

Let me break down the flow step by step:

  1. Initial Request from client to server

    • Client sends: file metadata (name, size, total chunks)

    • Server creates records in database:

      • New row in files table with status="pending"

      • Creates file_chunks entries for each expected chunk

    • Returns file_id to client

  2. Pre-signed URLs

    • Client requests to the server pre-signed URLs for chunks it's ready to upload

    • Server generates (by getting it from S3) URLs with specific paths: uploads/{file_id}/chunks/{chunk_number}

    • These paths aren’t random. They are placed in this specific order. This way we can determine which chunk belongs to which file.

  3. Chunk Upload Flow

    • Client now has the pre-signed URLs it requested

    • Client uploads chunk to S3 using pre-signed URL

    • S3 responds to client with success + ETag

    • S3 (S3 Event Notifications) sends event to SQS/SNS with:

    {
      "eventName": "ObjectCreated:Put",
      "s3": {
        "object": {
          "key": "uploads/file_123/chunks/1",
          "eTag": "..."
        }
      }
    }
  • Server processes event:

    • Parses file_id and chunk_number from the S3 key

    • Updates file_chunks status to "uploaded"

    • Updates files status if all chunks are uploaded

  1. Completion

    • Client sends completion request to the server when all chunks uploaded

    • Server:

      • Verifies all chunks are present in DB

      • Initiates S3 multipart completion request

      • Updates file status to "completed"

Completion request

ETags and the number of the chunk are needed for the S3 CompleteMultipartUpload API call. S3 needs:

  1. Each part's ETag (which S3 generated when each chunk was uploaded)

  2. The correct part number for each chunk

  3. All parts must be in order

Example of what S3 requires for completion:

await s3.completeMultipartUpload({
  Bucket: "my-bucket",
  Key: "final/file/path",
  UploadId: "...", // from when multipart upload was initiated
  MultipartUpload: {
    Parts: [
      { PartNumber: 1, ETag: "etag1..." },
      { PartNumber: 2, ETag: "etag2..." },
      { PartNumber: 3, ETag: "etag3..." }
    ]
  }
}).promise();

S3 uses these ETags to:

  • Verify all chunks were uploaded successfully

  • Verify chunks weren't corrupted

  • Assemble the final file in correct order

This is why we store ETags in our database as each chunk is uploaded. We need them in the end for this final step.