Size: a a a

Natural Language Processing

2021 October 02

Eg

Elena gisly in Natural Language Processing
прикольная какая штука, спасибо
источник

OA

Oluwaseun Alagbe in Natural Language Processing
I have a question. This data contains more than 5000 Targets (column CFOP) and feature (Column Descricao Resumida).  All the numbers in the CFOP column and its respective descriptions are all unique, is there any possible way to teach the machine to understand each of the descriptions relating to its possible target?

So for instance, if a user inputs a wrong description for the right target, the machine should be able to correct the wrong description.

Any advice please?
источник

IR

Ilya Runov in Natural Language Processing
Could you provide more details about the task?
Like for row 2 (1101) user inputs what?

As I understood this is not classification task because column A CFOP has unique values.
And as I understood your task is not correction (kinda spellchecking).

So what "value" machine should provide?
источник

OA

Oluwaseun Alagbe in Natural Language Processing
Oh, okay let me explain better.

The dataset contains up to 5000 rows and 2 columns.

So the problem I want to solve is more like a "corrective maintenance". For example,


2 = dogs are beautiful
3  = cats are so gentle
4 = fishes can't talk.

So now, if a user mistakenly inputs

3 = dogs are cute


It should be able to correct it that

2 = dogs are cute.

Please note (cute and beautiful are similar ways humans can express their love for something)
источник

IR

Ilya Runov in Natural Language Processing
So your task is: take user's input sentence, find most similar (kinda synonyms) sentence in you existing database and return it to user.

In this case machine could create "embedding" for every row (string) in existing database.
When user provides input machine creates its embedding. Then machine finds most similar embedding from db and returns db string that corresponds to this embedding.

Embedding could be created by gensim (fasttext) library or by "hugging face" (labse) library.

"corresponds to this embedding" could be done by "cosine similarity" for example. It seams gemsim library has this function.
источник

OA

Oluwaseun Alagbe in Natural Language Processing
Beautiful. Thanks so much for this.
I'd check it out
источник

OA

Oluwaseun Alagbe in Natural Language Processing
But would be a classification problem?

Because the "CFOP" column contains up to like 5000 unique values with each unique descriptions
источник

IR

Ilya Runov in Natural Language Processing
1
The approach I've described is not directly related to classification task.
2
Embedding has different "constraints"
Like words "beautiful" and "ugly" could be more similar than "dogs" and "dobermans".
So "dogs are beautiful" could be more similar to "dogs are ugly" than to "dobermans are beautiful".
источник

IR

Ilya Runov in Natural Language Processing
And you could use search engines. Colleagues have suggested several ones yesterday. Search for "milvus" word.
источник

DB

Debjyoti Banerjee in Natural Language Processing
While using flair for ner, I am getting cuda out of memory issue, does anyone know how to solve that issue?
источник

AM

Alex Mak in Natural Language Processing
Without exact code example its hard to say whats the problem exactly.
Recently i encountered the same but with pytotch. I was finetunng model. Reducing batch size helped.
источник

DB

Debjyoti Banerjee in Natural Language Processing
@xelandar See I am running the pre-trained flair's ner, I am loading flair's sequencetagger class and doing the prediction for identifying person's names from a list of text files that I have. After running for 7 files, it is throwing the cuda out of memory issue. I have more than 100 of text files.
источник

OP

Oleg Polivin in Natural Language Processing
Do you run it in a jupyter notebook?
источник

DB

Debjyoti Banerjee in Natural Language Processing
источник

DB

Debjyoti Banerjee in Natural Language Processing
I mean colab
источник

OP

Oleg Polivin in Natural Language Processing
jupyter notebooks do not free up cuda memory well, it would be better if you rewrite your code as a .py script
источник

DB

Debjyoti Banerjee in Natural Language Processing
@olegpolivin I already have a python script for that, but since I don't have GPU on my system, and on CPU it is taking so much time, so I am running it on Google Colab
источник

OA

Oluwaseun Alagbe in Natural Language Processing
Okay thanks so much
источник

🐙

🐙 in Natural Language Processing
Ребят, доброго вечера! Проблемы со spaCy, код вот:
nlp = Russian()
nlp.add_pipe("transformer", config=DEFAULT_CONFIG["transformer"])
nlp("Какой-то текст")
Ошибка на скриншоте. Что пытаюсь делать: просто пустую модель (русский вокаб и токенайзер) с одним единственным элементом - трансформером. Тренировать не нужно, хочу просто вызвать. Дока говорит, что должно завестись, но не заводится. Кто-нибудь сталкивался? Стоит постить issue?

UPD: Issue постить не буду. Обновил код до версии из мастера (она пока не зарилизена), там много поменяли, сейчас падает на undefined HF tokenizer, мне кажется, его добавят как параметр конфигурации.
источник
2021 October 03

🐙

🐙 in Natural Language Processing
Такс, разобрался, как сделать spaCy пайплайн с трансформером без тренировки. Кароче: есть пайплайн, он ест кофиг, создаёт под капотом модель, и она хранится в pipeline.model . Пайплайн мы создали, но модель ещё нужно инициализировать - вызывается подкапотная функция init, которая есть у всех thinc-моделей. Судя по всему, при тренировке инициализация происходит внутри процесса обучения, но при ручном создании её нужно дёрнуть.  Исходники spaCy хоть и немного сложные по архитектуре, но код простой и понятный.
источник