Once I download any file using scrapy. I'll be populated with files in the log. How can I stores the files information which contain checksum, path, url to the database?
Once I download any file using scrapy. I'll be populated with files in the log. How can I stores the files information which contain checksum, path, url to the database?
Once I download any file using scrapy. I'll be populated with files in the log. How can I stores the files information which contain checksum, path, url to the database?
Once I download any file using scrapy. I'll be populated with files in the log. How can I stores the files information which contain checksum, path, url to the database?
what do you mean by "populated with files in the log"? I did not get your question
I can recommend to you save files with names by their sha256. When there is many files - you will encounter slowing down on dir opening, so keep them in 3 subfolders with first 2 letters of hash-sum of them (6 symbols in general). Like /aa/bb/23/aabb234rest_of_sha.jpg In that case - you will only have to know their hash-sum to find them. So, you have to rewrite your file-downloader middleware to store files like this. And initiate saving info about checksum and file-url and other meta info to database. For easy work with database try this library: https://dataset.readthedocs.io/en/latest/
If you already have file in directory and same url - you do nothing. In most cases you have to check only URL - is it already in database. If url is different and you have same sha - you add it in url-fields for this sha-sum, it's up to you. If you have less than million images, you can consider of using md5, to save resources.
If you already have file in directory and same url - you do nothing. In most cases you have to check only URL - is it already in database. If url is different and you have same sha - you add it in url-fields for this sha-sum, it's up to you. If you have less than million images, you can consider of using md5, to save resources.
Instead of doing all this I can modify media_downloaded method in Files pipeline.
Кстати про хранение, кто-нибудь использовал что-то вроде монтируемых файлов для хранения кэша к примеру? Скачивал тут лям урлов - задолбался потом его удалять 😞 А левелдб теперь нетути. Думаю может вот эту штуку попробовать: https://pismotec.com/pfm/
Кстати про хранение, кто-нибудь использовал что-то вроде монтируемых файлов для хранения кэша к примеру? Скачивал тут лям урлов - задолбался потом его удалять 😞 А левелдб теперь нетути. Думаю может вот эту штуку попробовать: https://pismotec.com/pfm/
а если какую базу использовать? но это так, мысли в слух..