Quantcast
Channel: Data Science, Analytics and Big Data discussions - Latest topics
Viewing all articles
Browse latest Browse all 4448

Correlation matrix on a very large dataset

$
0
0

@kthouz wrote:

Hi,

I need a help/idea on how to generate a correlation matrix on a large dataset. I am dealing with a dataset of 1000000 customers (rows) and 50 items (columns). Each cell (i,j) is 1 if customer i has bought item j in the past. I want to find how customers are similar by calculating the correlation between customers.

A lazy algorithm is use two loops with n(n-1)/2 iterations (tried pandas.dataframe.corr as well). Doing this, my pc freezes. I am using python on mac (8Gb, 3.24GHz). I used Spark (scala) and it ran out of memory as well. I was thinking of mapreduce but a friend told me it won't help on such problem to carry pairwise computation.

Any idea please??

Posts: 3

Participants: 3

Read full topic


Viewing all articles
Browse latest Browse all 4448

Trending Articles