Телеграмм чат группы natural_language

12:08пожаловаться #1

I have a question. This data contains more than 5000 Targets (column CFOP) and feature (Column Descricao Resumida). All the numbers in the CFOP column and its respective descriptions are all unique, is there any possible way to teach the machine to understand each of the descriptions relating to its possible target?

So for instance, if a user inputs a wrong description for the right target, the machine should be able to correct the wrong description.

Any advice please?

13:35пожаловаться #2

Could you provide more details about the task?
Like for row 2 (1101) user inputs what?

As I understood this is not classification task because column A CFOP has unique values.
And as I understood your task is not correction (kinda spellchecking).

So what "value" machine should provide?

14:49пожаловаться #3

Oh, okay let me explain better.

The dataset contains up to 5000 rows and 2 columns.

So the problem I want to solve is more like a "corrective maintenance". For example,

2 = dogs are beautiful
3 = cats are so gentle
4 = fishes can't talk.

So now, if a user mistakenly inputs

3 = dogs are cute

It should be able to correct it that

2 = dogs are cute.

Please note (cute and beautiful are similar ways humans can express their love for something)

15:13пожаловаться #4

So your task is: take user's input sentence, find most similar (kinda synonyms) sentence in you existing database and return it to user.

In this case machine could create "embedding" for every row (string) in existing database.
When user provides input machine creates its embedding. Then machine finds most similar embedding from db and returns db string that corresponds to this embedding.

Embedding could be created by gensim (fasttext) library or by "hugging face" (labse) library.

"corresponds to this embedding" could be done by "cosine similarity" for example. It seams gemsim library has this function.

15:41пожаловаться #5

Beautiful. Thanks so much for this.
I'd check it out

15:49пожаловаться #6

But would be a classification problem?

Because the "CFOP" column contains up to like 5000 unique values with each unique descriptions

15:51пожаловаться #7

1
The approach I've described is not directly related to classification task.
2
Embedding has different "constraints"
Like words "beautiful" and "ugly" could be more similar than "dogs" and "dobermans".
So "dogs are beautiful" could be more similar to "dogs are ugly" than to "dobermans are beautiful".

16:09пожаловаться #8

And you could use search engines. Colleagues have suggested several ones yesterday. Search for "milvus" word.

16:17пожаловаться #9

Debjyoti Banerjee in Natural Language Processing

While using flair for ner, I am getting cuda out of memory issue, does anyone know how to solve that issue?

16:31пожаловаться #10

Alex Mak in Natural Language Processing

Without exact code example its hard to say whats the problem exactly.
Recently i encountered the same but with pytotch. I was finetunng model. Reducing batch size helped.

17:02пожаловаться #11

Debjyoti Banerjee in Natural Language Processing

@xelandar See I am running the pre-trained flair's ner, I am loading flair's sequencetagger class and doing the prediction for identifying person's names from a list of text files that I have. After running for 7 files, it is throwing the cuda out of memory issue. I have more than 100 of text files.

Oleg Polivin in Natural Language Processing

17:07пожаловаться #12

Do you run it in a jupyter notebook?

18:10пожаловаться #13

Debjyoti Banerjee in Natural Language Processing

Yes @olegpolivin

18:25пожаловаться #14

Debjyoti Banerjee in Natural Language Processing

I mean colab

Oleg Polivin in Natural Language Processing

18:26пожаловаться #15

jupyter notebooks do not free up cuda memory well, it would be better if you rewrite your code as a .py script

18:26пожаловаться #16

Debjyoti Banerjee in Natural Language Processing

@olegpolivin I already have a python script for that, but since I don't have GPU on my system, and on CPU it is taking so much time, so I am running it on Google Colab

18:27пожаловаться #17

Okay thanks so much

19:48пожаловаться #18

🐙

🐙 in Natural Language Processing

Ребят, доброго вечера! Проблемы со spaCy, код вот:

nlp = Russian()
nlp.add_pipe("transformer", config=DEFAULT_CONFIG["transformer"])
nlp("Какой-то текст")

Ошибка на скриншоте. Что пытаюсь делать: просто пустую модель (русский вокаб и токенайзер) с одним единственным элементом - трансформером. Тренировать не нужно, хочу просто вызвать. Дока говорит, что должно завестись, но не заводится. Кто-нибудь сталкивался? Стоит постить issue?

UPD: Issue постить не буду. Обновил код до версии из мастера (она пока не зарилизена), там много поменяли, сейчас падает на undefined HF tokenizer, мне кажется, его добавят как параметр конфигурации.

23:10пожаловаться #19

2021 October 03

🐙

🐙 in Natural Language Processing

Такс, разобрался, как сделать spaCy пайплайн с трансформером без тренировки. Кароче: есть пайплайн, он ест кофиг, создаёт под капотом модель, и она хранится в pipeline.model . Пайплайн мы создали, но модель ещё нужно инициализировать - вызывается подкапотная функция init, которая есть у всех thinc-моделей. Судя по всему, при тренировке инициализация происходит внутри процесса обучения, но при ручном создании её нужно дёрнуть. Исходники spaCy хоть и немного сложные по архитектуре, но код простой и понятный.