I began to fall in love with a Python Visualization library called Altair, and i use it with every small data science project that ive done.
Now, in terms of Industry use... moreI began to fall in love with a Python Visualization library called Altair, and i use it with every small data science project that ive done.
Now, in terms of Industry use cases, Does it make sense to visualize Big Data or should we just take a random sample?
I have three dataframes. Their shapes are (2656, 246), (2656, 2412) and (2656, 7025). I want to merge dataframes as... moreI have three dataframes. Their shapes are (2656, 246), (2656, 2412) and (2656, 7025). I want to merge dataframes as above:
So It will result a (2656, 9683) Dataframe. Thanks for any help.
Typo on image: on Dataframe 3, it will 7025, not 5668.
I am currently trying to open a file with pandas and python for machine learning purposes it would be ideal for me to have them all in a DataFrame. Now The file is 18GB large... more
I am currently trying to open a file with pandas and python for machine learning purposes it would be ideal for me to have them all in a DataFrame. Now The file is 18GB large and my RAM is 32 GB but I keep getting memory errors.
From your experience is it possible? If not do you know of a better way to go around this? (hive table? increase the size of my RAM to 64? create a database and access it from python)
What are the benefits of using either Hadoop or HBase or Hive ?From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of... moreWhat are the benefits of using either Hadoop or HBase or Hive ?From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS. Hive is a sql-like interface for Hadoop and HBase.I would also like to know how Hive compares with Pig.
I need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days.
I tried putting them in a... moreI need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days.
I tried putting them in a table and doing it in batches of 100. 4 days later, this is still running with only 297268 rows deleted. (I had to select 100 id's from an ID table, delete where IN that list, delete from ids table the 100 I selected).
I tried:
DELETE FROM tbl WHERE id IN (select * from ids)
That's taking forever, too. Hard to gauge how long, since I can't see it's progress till done, but the query was still running after 2 days.
Just kind of looking for the most effective way to delete from a table when I know the specific ID's to delete, and there are millions of IDs. less
I have a large set of data (about 8GB). I would like to use machine learning to analyze it. So, I think that I should use SVD then PCA to reduce the data dimensionality for... more
I have a large set of data (about 8GB). I would like to use machine learning to analyze it. So, I think that I should use SVD then PCA to reduce the data dimensionality for efficiency. However, MATLAB and Octave cannot load such a large dataset.
What tools I can use to do SVD with such a large amount of data?
I have recently started looking into querying large sets of CSV data lying on HDFS using Hive and Impala. As I was expecting, I get better response time with Impala compared... more
I have recently started looking into querying large sets of CSV data lying on HDFS using Hive and Impala. As I was expecting, I get better response time with Impala compared to Hive for the queries I have used so far.
I am wondering if there are some types of queries/use cases that still need Hive and where Impala is not a good fit.
How does Impala provide faster query response compared to Hive for the same data on HDFS?