It seems like R is really designed to handle datasets that it can pull entirely into memory. What R packages are recommended for signal processing and machine learning on very... moreIt seems like R is really designed to handle datasets that it can pull entirely into memory. What R packages are recommended for signal processing and machine learning on very large datasets that can not be pulled into memory?
If R is simply the wrong way to do this, I am open to other robust free suggestions (e.g. scipy if there is some nice way to handle very large datasets)
I never got a chance to work on Impala. I have just started reading about Impala. But i have one basic question which i am not clear about Impala. Impala has its own demons so it... moreI never got a chance to work on Impala. I have just started reading about Impala. But i have one basic question which i am not clear about Impala. Impala has its own demons so it also has its own execution engine or it works on MapR or other execution engine. Thanks in advance
From the official Hive documentation:Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries.
I'm... moreFrom the official Hive documentation:Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries.
I'm not an expert about database architecture, and I would like to know if there is an alternative when the assumption above is not true, that is, when queries are made over a big data set.
I am trying to understand what would be the best big data solution for reporting purposes?
Currently I narrowed it down to HBase vs Hive.
The use case is that we have hundreds of... moreI am trying to understand what would be the best big data solution for reporting purposes?
Currently I narrowed it down to HBase vs Hive.
The use case is that we have hundreds of terabytes of data with hundreds different files. The data is live and gets updated all the time. We need to provide the most efficient way to do reporting. We have dozens different reports pages where each report consist of different type of numeric and graph data. For instance:
Show all users that logged in to the system in the last hour and their origin is US.
Show a graph with the most played games to the least played games.
From all users in the system show the percentage of paying vs non paying users.
For a given user, show his entire history. How many games he played? What kind of games he played. What was his score in each and every game?
The way I see it, there are 3 solutions:
Store all data in Hadoop and do the queries in Hive. This might work but I am not sure about the performance. How will it perform when the... less
I have recently deployed Big Data cluster. In that I've used Apache Kafka and zookeeper. But still I didn't understand about its usage in cluster. When both are required and for... moreI have recently deployed Big Data cluster. In that I've used Apache Kafka and zookeeper. But still I didn't understand about its usage in cluster. When both are required and for what purpose?
I'm using google-bigquery on Chicago crime dataset. However, I want to find out the most frequent crime type from primary_type column for each distinct block. To do so, I come up... moreI'm using google-bigquery on Chicago crime dataset. However, I want to find out the most frequent crime type from primary_type column for each distinct block. To do so, I come up following standardSQL.Data:Since the Chicago crime data is rather big, there is an official website where you can preview the dataset:crime data on Google cloudMy current standard SQL:
SELECT primary_type,block, COUNT(*) as count
FROM `bigquery-public-data.chicago_crime.crime`
HAVING COUNT(*) = (SELECT MAX(count)
FROM (SELECT primary_type, COUNT(*) as count FROM `bigquery-public-data.chicago_crime.crime` GROUP BY primary_type, block) `bigquery-public-data.chicago_crime.crime`)
The problem of my above query is it has an error now, and to me, this query is quite inefficient even I fixed the error. How can I fix and optimize the above query?How to work with regex in standard SQL:To count the most frequent type for each block, including both North and South, I have to deal with regex, for example, 033XX S WOOD ST, I should... less
I have a query about how to filter relevant records from a large data set of financial transactions. We use Oracle 11g database and one of the requirements is to produce various... moreI have a query about how to filter relevant records from a large data set of financial transactions. We use Oracle 11g database and one of the requirements is to produce various end-of-day reports with all sorts of criteria.
The relevant tables look roughly like this:
trade_metadata 18m rows, 10 GB
trade_economics 18m rows, 15 GB
business_event 18m rows, 11 GB
trade_business_event_link 18m rows, 3 GB
One of our reports is now taking ages to run ( > 5 hours). The underlying proc has been optimized time and again but new criteria keep getting added so we start struggling again. The proc is pretty standard - join all the tables and apply a host of where clauses (20 at the last count).
I was wondering if I have a problem large enough to consider big data solutions to get rid of this optimize-the-query game every few months. In any case, the volumes are only going up. I have read up a bit about Hadoop + HBase, Cassandra, Apache Pig etc. but being very new to this... less
I can't wrap my head around the basic theoretical concept of 'Operational and Analytical Big Data'.
According to... moreI can't wrap my head around the basic theoretical concept of 'Operational and Analytical Big Data'.
According to me:
Operational Big Data: Branch where we can perform Read/write operations on big data using specially designed Databases (NoSQL). Somewhat similar to ETL in RDMS.
Analytical Big Data: Branch where we analyse data in retrospect and draw predictions using techniques like MPP and MapReduce. Somewhat similar to reporting in RDMS.
(Please feel free to correct wherever I'm wrong, it's just my understanding.)
So according to me, Hadoop is used for Analytical Big Data where we just process data for analysis but don't temper original data and hence is not an idea choice for ETL. But recently I have come across this article which advocates using Hadoop for ETL: https://www.datanami.com/2014/09/01/five-steps-to-running-etl-on-hadoop-for-web-companies/ less
I need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days.
I tried putting them in a... moreI need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days.
I tried putting them in a table and doing it in batches of 100. 4 days later, this is still running with only 297268 rows deleted. (I had to select 100 id's from an ID table, delete where IN that list, delete from ids table the 100 I selected).
I tried:
DELETE FROM tbl WHERE id IN (select * from ids)
That's taking forever, too. Hard to gauge how long, since I can't see it's progress till done, but the query was still running after 2 days.
Just kind of looking for the most effective way to delete from a table when I know the specific ID's to delete, and there are millions of IDs. less
If data reaching millions time computing is large.I want to compute in real time that i don't want to use Mapreduce
How to quickly count number of rows.
I am a relatively new user to Hadoop (using version 2.4.1). I installed hadoop on my first node without a hitch, but I can't seem to get the Resource Manager to start on my second... moreI am a relatively new user to Hadoop (using version 2.4.1). I installed hadoop on my first node without a hitch, but I can't seem to get the Resource Manager to start on my second node.I cleared up some "shared library" problems by adding this to yarn-env.sh and hadoop-env.sh:export HADOOP_HOME="/usr/local/hadoop"export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"I also added this to hadoop-env.sh:export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/nativebased on the advice of this post at horton works http://hortonworks.com/community/forums/topic/hdfs-tmp-dir-issue/That cleared up all of my error messages; when I run /sbin/start-yarn.sh I get this:starting yarn daemonsstarting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hduser-resourcemanager-HdNode.outlocalhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-HdNode.outThe only problem is, JPS says that the Resource Manager isn't running.What's going on here? less
Can anybody explain Apache Flume for me in a plain language? I'd appreciate an explanation with a practical example instead of abstract theoretical definitions, then I can... moreCan anybody explain Apache Flume for me in a plain language? I'd appreciate an explanation with a practical example instead of abstract theoretical definitions, then I can understand better.
What is it used for? At which stage of a BigData analysis is it used?
And what are prerequisites for learning it?
Please
As you would explain for a non-technical person
I'm curious if anyone can point to some successful extract, transform, load (ETL) automation libraries, papers, or use cases for somewhat inhomogenious data?
I would be... moreI'm curious if anyone can point to some successful extract, transform, load (ETL) automation libraries, papers, or use cases for somewhat inhomogenious data?
I would be interested to see any existing libraries dealing with scalable ETL solutions. Ideally these would be capable of ingesting 1-5 petabytes of data containing 50 billion records from 100 inhomogenious data sets in tens or hundreds of hours running on 4196 cores (256 I2.8xlarge AWS machines). I really do mean ideally, as I would be interested to hear about a system with 10% of this functionality to help reduce our team's ETL load.
Otherwise, I would be interested to see any books or review articles on the subject or high quality research papers. I have done a literature review and have only found lower quality conference proceedings with dubious claims.
I've seen a few commercial products advertised, but again, these make dubious claims without much evidence of their efficacy.
The datasets are rectangular and can take the form of fixed... less
I have a multinode Hadoop cluster setup with two nodes(one master node and one slave node). Each node with 8GB RAM.
I have also configured hive on the master node. Everything is... moreI have a multinode Hadoop cluster setup with two nodes(one master node and one slave node). Each node with 8GB RAM.
I have also configured hive on the master node. Everything is up and working.
Nodemanager and Datanode are working on the slave node.
ResourceManager, Namenode, and SecondaryNamenode are also working on the master node.
I am able to access the hive terminal as well, but I am not able to drop the database through the drop database databaseName; command. It is not showing any error but has been stuck for more than an hour... Three tables have size 10000 * 20. I thought these may be causing the speed issues, so I wanted to delete the database, but am not able to delete via drop database command, so is there any way to do it directly by deleting any files?
I have tried to access hive.metastore.warehouse.dir to delete the database directly, but this directory is completely empty.
Similar slow behavior can be observed with other hive commands as well. I am just able to run one... less
Could someone please let me know what does it mean by 'Big Data implementation over Cloud'
I have been using Amazon S3 to store data and query using hive, which I read is one of... moreCould someone please let me know what does it mean by 'Big Data implementation over Cloud'
I have been using Amazon S3 to store data and query using hive, which I read is one of the cloud implementation. I would like to know what exactly does this mean and all possible ways to implement it.
Thanks