Downloading large files from the internet can be time-consuming and error-prone. One efficient technique is to download the file in smaller parts (chunks) and merge them after completion. In this guide, we’ll show you how to automate and accelerate chunk downloads using curl
with parallel threads in Python.
Why Parallel Chunk Downloads?
- Faster downloads using multiple threads
- More stable over poor connections
- Improved control over large files
Requirements
- Python 3.x
curl
installed on your system- A server that supports HTTP
Range
requests
Python Script for Parallel Download
Save the following code as parallel_chunk_download.py
:
import os
import math
import threading
import subprocess
import requests
def get_file_size(url):
response = requests.head(url, allow_redirects=True)
if 'Content-Length' in response.headers:
return int(response.headers['Content-Length'])
else:
raise Exception("Cannot determine file size. Server does not return 'Content-Length'.")
def download_chunk(url, start, end, part_num):
filename = f"part{part_num:03d}.chunk"
cmd = ["curl", "-s", "-r", f"{start}-{end}", "-o", filename, url]
subprocess.run(cmd, check=True)
def merge_chunks(total_parts, output_file):
with open(output_file, "wb") as out:
for i in range(total_parts):
part = f"part{i:03d}.chunk"
with open(part, "rb") as pf:
out.write(pf.read())
os.remove(part)
def main():
url = input("Enter file URL: ").strip()
output_file = input("Enter output filename: ").strip()
chunk_size = 100 * 1024 * 1024 # 100 MB
total_size = get_file_size(url)
total_parts = math.ceil(total_size / chunk_size)
print(f"Total size: {total_size} bytes")
print(f"Starting parallel download in {total_parts} chunks...")
threads = []
for i in range(total_parts):
start = i * chunk_size
end = min(start + chunk_size - 1, total_size - 1)
t = threading.Thread(target=download_chunk, args=(url, start, end, i))
t.start()
threads.append(t)
for t in threads:
t.join()
print("Merging chunks...")
merge_chunks(total_parts, output_file)
print(f"Download complete: {output_file}")
if __name__ == "__main__":
main()
How It Works
- The script uses
requests
to find the total file size - Divides the file into 100MB chunks
- Spawns a thread for each chunk, each using
curl
with a specific byte range - Merges all parts after download
Tips
- Adjust
chunk_size
for optimal performance - To go beyond I/O bottlenecks, use
multiprocessing
instead ofthreading
- For unstable connections, ensure partial downloads are re-attempted
Conclusion
Using Python and curl
together allows you to automate and optimize file downloads, especially when working with large files. Parallel chunk downloading is an efficient and scriptable way to speed up your workflow.
No comments:
Post a Comment