Телеграмм чат группы scrapy

2020 April 03

AR

если скопировать дикт, у которого в велью есть лист, лист от этого не скопируется

AK

Anil Kumar in Scrapy

Once I download any file using scrapy. I'll be populated with files in the log. How can I stores the files information which contain checksum, path, url to the database?

источник

14:05пожаловаться #2

AR

Andrey Rahmatullin in Scrapy

там всего лишь
for k, v in dict(*args, **kwargs).items():
self[k] = v

источник

14:05пожаловаться #3

AK

Anil Kumar in Scrapy

Anil Kumar

Once I download any file using scrapy. I'll be populated with files in the log. How can I stores the files information which contain checksum, path, url to the database?

Any one please reply it's urgent

источник

14:06пожаловаться #4

МС

Михаил Синегубов in Scrapy

Andrey Rahmatullin

там всего лишь
for k, v in dict(*args, **kwargs).items():
self[k] = v

эээээ, т.е., по русски, если у меня дикт или айтем будет

{'cat': ['item1', 'item2']}

то будет будет болт по наглой рыжей морде? в смысле - копирование не спасет?

источник

14:07пожаловаться #5

AR

Andrey Rahmatullin in Scrapy

там нет копирования

источник

14:07пожаловаться #6

AR

Andrey Rahmatullin in Scrapy

там есть вон то что я написал

источник

14:07пожаловаться #7

AR

Andrey Rahmatullin in Scrapy

в терминах питона это копи, а не дипкопи

источник

14:07пожаловаться #8

МС

Михаил Синегубов in Scrapy

Andrey Rahmatullin

в терминах питона это копи, а не дипкопи

ок, все, главное я понял, это я не правильно сделал 🙃

источник

14:08пожаловаться #9

AK

Anil Kumar in Scrapy

Anil Kumar

Once I download any file using scrapy. I'll be populated with files in the log. How can I stores the files information which contain checksum, path, url to the database?

Anyone please reply

источник

14:10пожаловаться #10

i

ildar in Scrapy

Anil Kumar

Once I download any file using scrapy. I'll be populated with files in the log. How can I stores the files information which contain checksum, path, url to the database?

what do you mean by "populated with files in the log"? I did not get your question

источник

14:25пожаловаться #11

i

ildar in Scrapy

про дипкопи кстати недавно столкнулся тоже

источник

14:26пожаловаться #12

AK

Anil Kumar in Scrapy

ildar

what do you mean by "populated with files in the log"? I did not get your question

Once a file is downloaded. Scrapy field "files" will show in the console with info like checksum, path, url. I want this files field to store in db.

источник

14:27пожаловаться #13

i

ildar in Scrapy

I can recommend to you save files with names by their sha256. When there is many files - you will encounter slowing down on dir opening, so keep them in 3 subfolders with first 2 letters of hash-sum of them (6 symbols in general). Like /aa/bb/23/aabb234rest_of_sha.jpg
In that case - you will only have to know their hash-sum to find them. So, you have to rewrite your file-downloader middleware to store files like this. And initiate saving info about checksum and file-url and other meta info to database. For easy work with database try this library: https://dataset.readthedocs.io/en/latest/

источник

14:30пожаловаться #14

i

ildar in Scrapy

If you already have file in directory and same url - you do nothing. In most cases you have to check only URL - is it already in database. If url is different and you have same sha - you add it in url-fields for this sha-sum, it's up to you. If you have less than million images, you can consider of using md5, to save resources.

источник

14:37пожаловаться #15

AK

Anil Kumar in Scrapy

ildar

If you already have file in directory and same url - you do nothing. In most cases you have to check only URL - is it already in database. If url is different and you have same sha - you add it in url-fields for this sha-sum, it's up to you. If you have less than million images, you can consider of using md5, to save resources.

Instead of doing all this I can modify media_downloaded method in Files pipeline.

источник

14:45пожаловаться #16

i

ildar in Scrapy

it's up to you where you want to modify code and insert working with database.

источник

14:46пожаловаться #17

i

ildar in Scrapy

it's already using sha1 hash for storage. It's up to you what hash you choose. And how you organize your folders.
https://docs.scrapy.org/en/latest/topics/media-pipeline.html

источник

14:50пожаловаться #18

i

ildar in Scrapy

Кстати про хранение, кто-нибудь использовал что-то вроде монтируемых файлов для хранения кэша к примеру? Скачивал тут лям урлов - задолбался потом его удалять 😞 А левелдб теперь нетути. Думаю может вот эту штуку попробовать: https://pismotec.com/pfm/

источник

14:55пожаловаться #19

МС

Михаил Синегубов in Scrapy

ildar

Кстати про хранение, кто-нибудь использовал что-то вроде монтируемых файлов для хранения кэша к примеру? Скачивал тут лям урлов - задолбался потом его удалять 😞 А левелдб теперь нетути. Думаю может вот эту штуку попробовать: https://pismotec.com/pfm/

а если какую базу использовать?
но это так, мысли в слух..

источник

15:04пожаловаться #20