s
I have several questions about the second part of Data Clustering Contest.
1. Is it ok that there are very few docs in the test archive that have publish time +3 days from the actual date of the name folder? They are extremely bad for testing and if the programs can be tested against it, it breaks all the relevance.
2. There are docs inside the folder which language cannot be described as two-letter ISO 639-1 language code, what should we do with them? Don't show/default to English?
3. Which type of disk will be testing on the server? HDD, SSD, NVME? Which instruction sets will be supported? Can we guarantee at least SSE4.2? Other way it is hard to provide a working binary not knowning which instructions are supported on x86-64.
4. In the server part: can reindexing of the doc change its language? Isn't that a problem that with the reindexing the max document time can go backwards?
5. Can we postpone the doc responses? I mean, if we've indexed some doc, can it be shown only after some time? Is it allowed or should we guarantee strong consistency?
6. Can we guarantee that 16GB will be enough for all docs to fit in memory during the requests? With their text/without text? This is highly related to the 3rd question