Guide to file uploads
The Old Way
Back then, file uploads were simple. Files went straight to your server. This caused problems:
Server gets overwhelmed
Security risks
High bandwidth costs
Timeouts with big files
Pre-signed URLs: A Better Way
Instead of uploading directly to your server, we now use pre-signed URLs. Here's how it works:
Client asks your server for permission to upload
Server generates a temporary URL from S3
Client uploads directly to S3
S3 returns a response (HTTP 200) to the client when upload is successful
Client tells your server it's done
Server gets the permanent URL from S3
Stores the URL in the database (or wherever you need it)
Pre-signed URLs are safer because they expire quickly and have specific permissions. Plus, your server doesn't touch the actual files.
Handling Big Files with Chunks
Big files need special handling. We split them into smaller pieces:
Break the file into chunks (like 5MB each)
Create a fingerprint for each chunk
Track each chunk's status
Fingerprinting is key here. It creates a unique ID for each chunk based on its content. This helps with:
Finding duplicate chunks
Checking if chunks are corrupt
Picking up where you left off if upload fails
Verifying Uploads
We have two ways to check if chunks actually made it to S3:
Client tells us (Client-driven):
Client uploads to S3
S3 confirms to client
Client tells our server
Server double-checks with S3
S3 tells us directly (Storage-driven):
Client uploads to S3
S3 automatically tells our server
Server updates its records
The second way (S3 events) is better because:
More reliable
Faster
Simpler code
Can't be faked
S3 Multipart Uploads
S3 built this chunking system right in. Their multipart upload:
Handles chunks for you
Puts files back together
Checks everything worked
Lets you pause and resume
Uses network better
How S3 multipart upload works (full flow explained)
I realized my notes were quite ambiguous and wanted to extend the post with a separate section.
Let me break down the flow step by step:
Initial Request from client to server
Client sends: file metadata (name, size, total chunks)
Server creates records in database:
New row in
files
table with status="pending"Creates
file_chunks
entries for each expected chunk
Returns file_id to client
Pre-signed URLs
Client requests to the server pre-signed URLs for chunks it's ready to upload
Server generates (by getting it from S3) URLs with specific paths:
uploads/{file_id}/chunks/{chunk_number}
These paths aren’t random. They are placed in this specific order. This way we can determine which chunk belongs to which file.
Chunk Upload Flow
Client now has the pre-signed URLs it requested
Client uploads chunk to S3 using pre-signed URL
S3 responds to client with success + ETag
S3 (S3 Event Notifications) sends event to SQS/SNS with:
{
"eventName": "ObjectCreated:Put",
"s3": {
"object": {
"key": "uploads/file_123/chunks/1",
"eTag": "..."
}
}
}
Server processes event:
Parses
file_id
andchunk_number
from the S3 keyUpdates
file_chunks
status to "uploaded"Updates
files
status if all chunks are uploaded
Completion
Client sends completion request to the server when all chunks uploaded
Server:
Verifies all chunks are present in DB
Initiates S3 multipart completion request
Updates file status to "completed"
Completion request
ETags and the number of the chunk are needed for the S3 CompleteMultipartUpload API call. S3 needs:
Each part's ETag (which S3 generated when each chunk was uploaded)
The correct part number for each chunk
All parts must be in order
Example of what S3 requires for completion:
await s3.completeMultipartUpload({
Bucket: "my-bucket",
Key: "final/file/path",
UploadId: "...", // from when multipart upload was initiated
MultipartUpload: {
Parts: [
{ PartNumber: 1, ETag: "etag1..." },
{ PartNumber: 2, ETag: "etag2..." },
{ PartNumber: 3, ETag: "etag3..." }
]
}
}).promise();
S3 uses these ETags to:
Verify all chunks were uploaded successfully
Verify chunks weren't corrupted
Assemble the final file in correct order
This is why we store ETags in our database as each chunk is uploaded. We need them in the end for this final step.