I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support.... moreI have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support. However, SAS is horrible as a piece of software for numerous other reasons.
One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.
My first thought is to use HDFS store to hold large datasets on disk and pull only the pieces I need into dataframes for analysis. Others have mentioned MongoDB as an easier to use alternative. My question is this:
What are some best-practice workflows for accomplishing the following:
Loading flat files into a permanent, on-disk database structure
Querying that database to retrieve data to feed into a pandas data structure
Updating the database after manipulating pieces in... less
I am not a database expert and have no formal computer science background, so bear with me. I want to know the kinds of real world negative things that can happen if you use an... moreI am not a database expert and have no formal computer science background, so bear with me. I want to know the kinds of real world negative things that can happen if you use an old MongoDB version prior to v4, which were not ACID compliant. This applies to any ACID noncompliant database.
I understand that MongoDB can perform Atomic Operations, but that they don't "support traditional locking and complex transactions", mostly for performance reasons. I also understand the importance of database transactions, and the example of when your database is for a bank, and you're updating several records that all need to be in sync, you want the transaction to revert back to the initial state if there's a power outage so credit equals purchase, etc.
But when I get into conversations about MongoDB, those of us that don't know the technical details of how databases are actually implemented start throwing around statements like:
MongoDB is way faster than MySQL and Postgres, but there's a tiny chance, like... less
I'm using Simba's ODBC driver for connecting Tableau to MongoDB. I connected my workbook to MongoDB, developed my dashboard and everything went pretty smooth. However, after I... moreI'm using Simba's ODBC driver for connecting Tableau to MongoDB. I connected my workbook to MongoDB, developed my dashboard and everything went pretty smooth. However, after I publish my workbook to Tableau server and try to open it, I get an error message:
"An unexpected error occurred. If you continue to receive this error please contact your Tableau Server Administrator. Tableau Server encountered an error. Contact your Tableau Server Administrator. Would you like to reset the view?"
Why can't I view this dashboard in Tableau Server? less
I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support.... moreI have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support. However, SAS is horrible as a piece of software for numerous other reasons.
One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.
My first thought is to use HDF store to hold large datasets on disk and pull only the pieces I need into dataframes for analysis. Others have mentioned MongoDB as an easier to use alternative. My question is this:
What are some best-practice workflows for accomplishing the following:
Loading flat files into a permanent, on-disk database structure
Querying that database to retrieve data to feed into a pandas data structure
Updating the database after manipulating pieces in... less
I want to export all collections in MongoDB by the command:
mongoexport -d dbname -o... moreI want to export all collections in MongoDB by the command:
mongoexport -d dbname -o Mongo.json
The result is:No collection specified!
The manual says, if you don't specify a collection, all collections will be exported.However, why doesn't this work?
http://docs.mongodb.org/manual/reference/mongoexport/#cmdoption-mongoexport--collection
My MongoDB version is 2.0.6.
I need to implement a big data storage + processing system.
The data increases in a daily basis ( about max 50 million rows / day) , data complies of a very simple JSON document... moreI need to implement a big data storage + processing system.
The data increases in a daily basis ( about max 50 million rows / day) , data complies of a very simple JSON document of about 10 fields ( date,numbers, text, ids).
Data could then be queried online ( if possible) making arbitrary groupings on some of the fields of the document ( date range queries, ids ,etc ) .
I'm thinking on using a MongoDB cluster for storing all this data and build indices for the fields I need to query from, then process the data in an apache Spark cluster ( mostly simple aggregations+sorting). Maybe use Spark-jobserver to build a rest-api around it.
I have concerns about mongoDB scaling possibilities ( i.e storing 10b+ rows ) and throughput ( quickly sending 1b+ worth of rows to spark for processing) or ability to maintain indices in such a large database.
In contrast, I consider using cassandra or hbase, which I believe are more suitable for storing large datasets, but offer less performance in querying which I'd... less