Телеграмм чат группы rlang

Size: a a a

R language and Statistical data analysis

997 membersпожаловаться на группу

2020 March 02

Vladimir Volokhonsky in R language and Statistical data analysis

Знатоки рекомендуют для целочисленных использовать tabulate вместо table.

источник

16:08пожаловаться #1

Philipp Upravitelev in R language and Statistical data analysis

так он с интами работает, а у меня строки

источник

16:09пожаловаться #2

ГД

Григорий Демин in R language and Statistical data analysis

Артём Клевцов

which.max вернёт первую и не гарантирует стабильность результата.

Да, есть такое дело. Но тут очень зависит от того, что дальше делать с этим результатом. which(res %in% max(res)) вернет коды цветов, которые чаще всего встречаются

источник

16:10пожаловаться #3

Philipp Upravitelev in R language and Statistical data analysis

Григорий Демин

у меня чота не очень получается

> matr <- x
> splitter = matr[,3]*100*100 + matr[,2]*100 + matr[,1]
> res = tabulate(splitter)
> which(splitter == which.max(res))
[1] 1 3 5
> f_base(x)
     [,1] [,2] [,3]
[1,]    1    2    1

источник

16:14пожаловаться #4

ГД

Григорий Демин in R language and Statistical data analysis

which(splitter == which.max(res)) - неверно, как я понял.

источник

16:15пожаловаться #5

ГД

Григорий Демин in R language and Statistical data analysis

f_dt и f_base у меня выдают разные резудьтаты

источник

16:16пожаловаться #6

Philipp Upravitelev in R language and Statistical data analysis

Григорий Демин

f_dt и f_base у меня выдают разные резудьтаты

странно

> f_base(x)
     [,1] [,2] [,3]
[1,]    1    2    1
> f_dt(x)
     [,1] [,2] [,3]
[1,] 1    2    1

источник

16:24пожаловаться #7

Philipp Upravitelev in R language and Statistical data analysis

а, там в dt надо еще анлист сделать

источник

16:25пожаловаться #8

ГД

Григорий Демин in R language and Statistical data analysis

Я матрицу изменил:

set.seed(123)
N = 10000
x <- matrix(nrow = N, ncol = 3, data = sample(0:31, N*3, replace = TRUE))

источник

16:25пожаловаться #9

Philipp Upravitelev in R language and Statistical data analysis

хм

источник

16:26пожаловаться #10

ГД

Григорий Демин in R language and Statistical data analysis

В f_base в paste в последнем mx пропущена запятая

источник

16:41пожаловаться #11

ГД

Григорий Демин in R language and Statistical data analysis

set.seed(123)
N = 10000
x <- matrix(nrow = N, ncol = 3, data = sample(0:31, N*3, replace = TRUE))
mx <- x
library(data.table)
f_dt <- function(mx) {
    mx <- as.data.frame(mx)
    setDT(mx)
    mx <- mx[, .N, by = list(V1, V2, V3)][order(-N)][1]
    mx[, cbind(V1, V2, V3)]
}
f_dt(x)


f_base <- function(mx) {
    mx <- paste(mx[, 1], mx[, 2], mx[, 3])
    mx <- sort(table(mx), decreasing = TRUE)
    mx <- names(mx[1])
    mx <- as.numeric(strsplit(mx, '\\s')[[1]])
    matrix(nrow = 1, ncol = 3, data = mx)
}
f_base(x)

f_tabulate = function(mx){
    mult = 100
    splitter = mx[,3]*mult*mult + mx[,2]*mult + mx[,1]
    res = tabulate(splitter)
    res = which(res == max(res))[1]
    mx[splitter==res, ,drop = FALSE][1, ,drop = FALSE]
}
f_tabulate(x)

library(microbenchmark)
microbenchmark(
    dt = f_dt(x),
    base = f_base(x),
    tabulate = f_tabulate(x)
)
# Unit: milliseconds
# expr       min        lq      mean    median        uq       max neval
# dt       3.108099  3.325447  3.886822  3.967064  4.221989  9.307412   100
# base    57.220403 59.190113 60.389002 60.283311 61.184157 66.289113   100
# tabulate 1.217681  1.334715  2.162456  1.357393  1.439168 60.760551   100

источник

16:42пожаловаться #12

АК

Артём Клевцов in R language and Statistical data analysis

Наивное решение с Rcpp:

// [[Rcpp::plugins(cpp11)]]

#include <Rcpp.h>

using namespace Rcpp;

template <typename T>
class hasher {
public:
  std::size_t operator()(const T& vec) const {
    size_t seed = vec.size();
    for(auto& i : vec) {
      seed ^= i + 0x9e3779b9 + (seed << 6) + (seed >> 2);
    }
    return seed;
  }
};

// [[Rcpp::export]]
List count_rows(IntegerMatrix x) {
  size_t nrows = x.rows();
  hasher<IntegerMatrix::Row> hash_fn;
  std::vector<std::string> hashes(nrows);
  std::unordered_map<std::string,int> cnt;
  for (size_t i = 0; i < nrows; ++i) {
    IntegerMatrix::Row ri = x.row(i);
    std::string h = std::to_string(hash_fn(ri));
    hashes[i] = h;
    cnt[h]++;
  }
  List res = List::create(
    Named("hash") = hashes,
    Named("counts") = cnt
  );
  return res;
}

/*** R
set.seed(123)
N = 10000
x <- matrix(nrow = N, ncol = 3, data = sample(0:31, N*3, replace = TRUE))
res <- count_rows(x)
h  <- names(res$counts)[max(res$counts) == res$counts]
match(h, res$hash)
*/

источник

16:44пожаловаться #13

АК

Артём Клевцов in R language and Statistical data analysis

Хэшируем строки, а потом считаем повторяемость.

источник

16:44пожаловаться #14

АК

Артём Клевцов in R language and Statistical data analysis

В res$counts частоты хэшей, а в res$hash хэш для каждой строки.

источник

16:45пожаловаться #15

АК

Артём Клевцов in R language and Statistical data analysis

Функция хэширования украдена отсюда: https://stackoverflow.com/a/27216842/1863950

Stack Overflow

A good hash function for a vector

I have some vector of integer that I would like to store efficiently in a unordered_map in c++11 my question is this:

How do I best store these and optimize for .find queries?

I came up with the

источник

16:46пожаловаться #16

ГД

Григорий Демин in R language and Statistical data analysis

Хммм, если в функции f_tabulate вместо множителя 100 поставить 32, то будет почти в 10 раз быстрее data.table:

f_tabulate = function(mx){
    mult = 32
    splitter = mx[,3]*mult*mult + mx[,2]*mult + mx[,1]
    res = tabulate(splitter)
    res = which(res == max(res))[1]
    mx[splitter==res, ,drop = FALSE][1, ,drop = FALSE]
}
# Unit: microseconds
# expr       min        lq       mean     median         uq       max neval
# dt            3098.166  3185.073  3728.0666  3334.8835  4014.0755 12412.863   100
# base     57688.870 58858.048 60201.7815 59508.4385 60817.6605 70619.858   100
# tabulate   314.518   352.260   409.4361   359.3785   364.3445  2914.091   100

источник

16:55пожаловаться #17

АК

Артём Клевцов in R language and Statistical data analysis

Григорий Демин

Хммм, если в функции f_tabulate вместо множителя 100 поставить 32, то будет почти в 10 раз быстрее data.table:

f_tabulate = function(mx){
    mult = 32
    splitter = mx[,3]*mult*mult + mx[,2]*mult + mx[,1]
    res = tabulate(splitter)
    res = which(res == max(res))[1]
    mx[splitter==res, ,drop = FALSE][1, ,drop = FALSE]
}
# Unit: microseconds
# expr       min        lq       mean     median         uq       max neval
# dt            3098.166  3185.073  3728.0666  3334.8835  4014.0755 12412.863   100
# base     57688.870 58858.048 60201.7815 59508.4385 60817.6605 70619.858   100
# tabulate   314.518   352.260   409.4361   359.3785   364.3445  2914.091   100

Хм.

> length(splitter)
[1] 10000
> length(tabulate(splitter))
[1] 32759

источник

17:25пожаловаться #18

ГД

Григорий Демин in R language and Statistical data analysis

Артём Клевцов

Хм.

> length(splitter)
[1] 10000
> length(tabulate(splitter))
[1] 32759

Да, он же создает на каждое возможное значение свою корзину, где считает сколько раз оно встречается

источник

17:26пожаловаться #19

ГД

Григорий Демин in R language and Statistical data analysis

То есть в результате будет вектор длиной max(arg)

источник

17:27пожаловаться #20