I have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a... moreI have created a DocumentTermMatrix that contains 1859 documents (rows) and 25722 (columns). In order to perform further calculations on this matrix I need to convert it to a regular matrix. I want to use the as.matrix() command. However, it returns the following error: cannot allocate vector of size 364.8 MB.> corpA corpus with 1859 text documents> mat<-DocumentTermMatrix(corp)> dim(mat) 1859 25722> is(mat) "DocumentTermMatrix"> mat2<-as.matrix(mat)Fehler: kann Vektor der Größe 364.8 MB nicht allozieren # cannot allocate vector of size 364.8 MB> object.size(mat)5502000 bytesFor some reason the size of the object seems to increase dramatically whenever it is transformed to a regular matrix. How can I avoid this?Or is there an alternative way to perform regular matrix operations on a DocumentTermMatrix? less
I'm working on some IoT integrations and I am wondering where in the Azure I can parse my IoT data (JSON data).
My earlier workflow was this; sensor pushes data -> iot hub -> stream analytics jobs -> sql database. Stream analytics job works fine but I have heard that it is not "right" way to parse data in Azure. So what is the right and best way to do that. I need to save it to SQL database.
I can perform an adf test on a vector:
library(tseries)
ht <- adf.test(vector, alternative="stationary", k=0)
but I am having trouble performing it on columns of values in a... moreI can perform an adf test on a vector:
library(tseries)
ht <- adf.test(vector, alternative="stationary", k=0)
but I am having trouble performing it on columns of values in a data.frame:
ht <- adf.test(dataframe, alternative="stationary", k=0)
Is there a way of doing this?
I have this only in my namenode:
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
In my data... moreI have this only in my namenode:
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
In my data nodes, I have this:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Now my question is, will the replication factor be 3 or 1?
At the moment, the output of hdfs dfs -ls hdfs:///user/hadoop-user/data/0/0/0 shows 1 replication factor:
-rw-r--r-- 1 hadoop-user supergroup 68313 2015-11-06 19:32 hdfs:///user/hadoop-user/data/0/0/0/00099954tnemhcatta.bin
I'm new to time series and used the monthly ozone concentration data from Rob Hyndman's websiteto do some forecasting.After doing a log transformation and differencing by lags 1... moreI'm new to time series and used the monthly ozone concentration data from Rob Hyndman's websiteto do some forecasting.After doing a log transformation and differencing by lags 1 and 12 to get rid of the trend and seasonality respectively, I plotted the ACF and PACF shown . Am I on the right track and how would I interpret this as a SARIMA?There seems to be a pattern every 11 lags in the PACF plot, which makes me think I should do more differencing (at 11 lags), but doing so gives me a worse plot. I'd really appreciate any of your help!EDIT: I got rid of the differencing at lag 1 and just used lag 12 instead, and this is what I got for the ACF and PACF.From there, I deduced that: SARIMA(1,0,1)x(1,1,1) (AIC: 520.098) or SARIMA(1,0,1)x(2,1,1) (AIC: 521.250) would be a good fit, but auto.arima gave me (3,1,1)x(2,0,0) (AIC: 560.7) normally and (1,1,1)x(2,0,0) (AIC: 558.09) without stepwise and approximation.I am confused on which model to use, but based on the lowest AIC, SAR(1,0,1)x(1,1,1) would be the... less
I have a multinode Hadoop cluster setup with two nodes(one master node and one slave node). Each node with 8GB RAM.
I have also configured hive on the master node. Everything is... moreI have a multinode Hadoop cluster setup with two nodes(one master node and one slave node). Each node with 8GB RAM.
I have also configured hive on the master node. Everything is up and working.
Nodemanager and Datanode are working on the slave node.
ResourceManager, Namenode, and SecondaryNamenode are also working on the master node.
I am able to access the hive terminal as well, but I am not able to drop the database through the drop database databaseName; command. It is not showing any error but has been stuck for more than an hour... Three tables have size 10000 * 20. I thought these may be causing the speed issues, so I wanted to delete the database, but am not able to delete via drop database command, so is there any way to do it directly by deleting any files?
I have tried to access hive.metastore.warehouse.dir to delete the database directly, but this directory is completely empty.
Similar slow behavior can be observed with other hive commands as well. I am just able to run one... less
It's been quite a while since I did any statistics so I am struggling with the definitions of a Poisson distribution. What I understand by the "rate is constant" is that if a... moreIt's been quite a while since I did any statistics so I am struggling with the definitions of a Poisson distribution. What I understand by the "rate is constant" is that if a customer purchases 1 thing on average in a week, they purchase 4 things on average in a four-week period. Is this correct?
Where I believe I am confused is with the final sentence. Is this saying that the time between a customers purchases would grow exponentially as time goes on? Doesn't this contradict the idea that we have a constant rate of purchase? less
I have been using the introductory example of matrix multiplication in TensorFlow.matrix1 = tf.constant()matrix2 = tf.constant(,)product = tf.matmul(matrix1, matrix2)When I print... moreI have been using the introductory example of matrix multiplication in TensorFlow.matrix1 = tf.constant()matrix2 = tf.constant(,)product = tf.matmul(matrix1, matrix2)When I print the product, it is displaying it as a Tensor object:But how do I know the value of product?The following doesn't help:print productTensor("MatMul:0", shape=TensorShape(), dtype=float32)I know that graphs run on Sessions, but isn't there any way I can check the output of a Tensorobject without running the graph in a session?
I am confused about the difference between batch and growing batch q learning. Also, if I only have historical data, can I implement growing batch q learning?Thank you!
What does it mean to "unroll a RNN dynamically". I've seen this specifically mentioned in the Tensorflow source code, but I'm looking for a conceptual explanation that extends to... moreWhat does it mean to "unroll a RNN dynamically". I've seen this specifically mentioned in the Tensorflow source code, but I'm looking for a conceptual explanation that extends to RNN in general.In the tensorflow rnn method, it is documented:If the sequence_length vector is provided, dynamic calculation is performed. This method of calculation does not compute the RNN steps past the maximum sequence length of the minibatch (thus saving computational time),But in the dynamic_rnn method it mentions:The parameter sequence_length is optional and is used to copy-through state and zero-out outputs when past a batch element's sequence length. So it's more for correctness than performance, unlike in rnn().So does this mean rnn is more performant for variable length sequences? What is the conceptual difference between dynamic_rnn and rnn? less
I'm trying to write a simple RNN in tensorflow, based on the tutorial here: https://danijar.com/introduction-to-recurrent-networks-in-tensorflow/ (I'm using a simple RNN cell... moreI'm trying to write a simple RNN in tensorflow, based on the tutorial here: https://danijar.com/introduction-to-recurrent-networks-in-tensorflow/ (I'm using a simple RNN cell rather than GRU, and not using dropout).I'm confused because the different RNN cells in my sequence appear to be being assigned separate weights. If I run the following codeimport tensorflow as tfseq_length = 3n_h = 100 # Number of hidden unitsn_x = 26 # Size of input layern_y = 26 # Size of output layerinputs = tf.placeholder(tf.float32, )cells = for _ in range(seq_length): cell = tf.contrib.rnn.BasicRNNCell(n_h) cells.append(cell)multi_rnn_cell = tf.contrib.rnn.MultiRNNCell(cells)initial_state = tf.placeholder(tf.float32, )outputs_h, output_final_state = tf.nn.dynamic_rnn(multi_rnn_cell, inputs, dtype=tf.float32)sess = tf.Session()sess.run(tf.global_variables_initializer())print('Trainable variables:')for v in tf.trainable_variables(): print(v)If I run this in python 3, I get the following output:Trainable variables:Firstly,... less
I'm building a RNN model to do the image classification. I used a pipeline to feed in the data. However it returnsValueError: Variable rnn/rnn/basic_rnn_cell/weights already... moreI'm building a RNN model to do the image classification. I used a pipeline to feed in the data. However it returnsValueError: Variable rnn/rnn/basic_rnn_cell/weights already exists, disallowed. Did you mean to set reuse=True in VarScope? Originally defined at:I wonder what can I do to fix this since there are not many examples of implementing RNN with an input pipeline. I know it would work if I use the placeholder, but my data is already in the form of tensors. Unless I can feed the placeholder with tensors, I prefer just to use the pipeline.def RNN(inputs):with tf.variable_scope('cells', reuse=True): basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=batch_size)with tf.variable_scope('rnn'): outputs, states = tf.nn.dynamic_rnn(basic_cell, inputs, dtype=tf.float32)fc_drop = tf.nn.dropout(states, keep_prob)logits = tf.contrib.layers.fully_connected(fc_drop, batch_size, activation_fn=None)return logits#Trainingwith tf.name_scope("cost_function") as scope: cost =... less
We are only using the RNN decoder (without encoder) for text generation, how is RNN decoder different from pure RNN operation?RNN Decoder in... moreWe are only using the RNN decoder (without encoder) for text generation, how is RNN decoder different from pure RNN operation?RNN Decoder in TensorFlow: https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/dynamic_rnn_decoderPure RNN in TensorFlow: https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnnThanks for your time
I want to get started on HMM's, but don't know how to go about it. Can people here, give me some basic pointers, where to look?
More than just the theory, I like to do a lot of... moreI want to get started on HMM's, but don't know how to go about it. Can people here, give me some basic pointers, where to look?
More than just the theory, I like to do a lot of hands-on. So, would prefer resources, where I can write small code snippets to check my learning, rather than just dry text.
I had a tough evening today trying to convince one of my colleagues that NLP or Natural Language Processing is the super set and Text Analyticsis a sub set of it. At the best... moreI had a tough evening today trying to convince one of my colleagues that NLP or Natural Language Processing is the super set and Text Analyticsis a sub set of it. At the best probably both are synonymous and can be used interchangeably.
Is that correct? Anybody who has a crystal clarity as to whether these terms have a boundary well defined or can be used interchangeably?
It seems like R is really designed to handle datasets that it can pull entirely into memory. What R packages are recommended for signal processing and machine learning on... more
It seems like R is really designed to handle datasets that it can pull entirely into memory. What R packages are recommended for signal processing and machine learning on very large datasets that can not be pulled into memory?
If R is simply the wrong way to do this, I am open to other robust free suggestions (e.g. scipy if there is some nice way to handle very large datasets)
I'm trying to do a little bit of distribution plotting and fitting in Python using SciPy for stats and matplotlib for the plotting. I'm having good luck with some things like... moreI'm trying to do a little bit of distribution plotting and fitting in Python using SciPy for stats and matplotlib for the plotting. I'm having good luck with some things like creating a histogram:
seed(2)
alpha=5
loc=100
beta=22
data=ss.gamma.rvs(alpha,loc=loc,scale=beta,size=5000)
myHist = hist(data, 100, normed=True)
Brilliant!
I can even take the same gamma parameters and plot the line function of the probability distribution function (after some googling):
rv = ss.gamma(5,100,22)
x = np.linspace(0,600)
h = plt.plot(x, rv.pdf(x))
How would I go about plotting the histogram myHist with the PDF line h superimposed on top of the histogram? I'm hoping this is trivial, but I have been unable to figure it out. less
I'm trying to follow some of the best practices of the "open science" movement. In my thesis, I've performed all of the analyses in R (a non-proprietary, open-source program... more
I'm trying to follow some of the best practices of the "open science" movement. In my thesis, I've performed all of the analyses in R (a non-proprietary, open-source program for analyzing data), and my datasets are in the non-proprietary CSV format.
I would like to be as transparent as possible, by sharing my datasets and R analysis/code files with my thesis committee, and ultimately with the public once my thesis is finalized and placed in a repository. How can I best do this?
I was thinking about uploading my files to the Open Science Framework (http://osf.io) and citing them with a regular HTTPS link. Once my thesis is finalized, I would then "freeze" them on the OSF website (as I understand, this would prevent post-hoc changes), then get a DOI that points to the frozen files and cite that.
Are there any better options? less
I am a learner in Big data concepts. Based on my understanding Big Data is critical in handling unstructured data and high volume.When we look at the big data architecture... more
I am a learner in Big data concepts. Based on my understanding Big Data is critical in handling unstructured data and high volume.When we look at the big data architecture for a datawarehouse (DW) the data from source is extracted through the Hadoop (HDFS and Mapreduce) and the relevant unstructured information is converted to a valid business information and finally data is injected to the DW or DataMart through ETL processing (along with the existing sturctured data processing).
However i would like to know what are the new techniques/new dimensional model or storage requirements required at DW for an organization (due to the Big Data) as most of the tutorials/resources i try to learn only talks about Hadoop at source but not at target. How does the introduction of Big Data impacts the predefined reports/adhoc analysis of an organization due to this high volume of data
Appreciate your response less
I have got big Data file loaded in Spark but wish to work on a small portion of it to run the analysis, is there any way to do that ?. I tried doing repartition but it brings a... moreI have got big Data file loaded in Spark but wish to work on a small portion of it to run the analysis, is there any way to do that ?. I tried doing repartition but it brings a lot of reshuffling. Is there any good of way of processing the only small chunk of a Big file loaded in Spark?.
I am currently working as a Data-Scientist. I am planning to appear for a few data-science interviews in distant future and I am aiming to have in-depth Statistical(at par with... moreI am currently working as a Data-Scientist. I am planning to appear for a few data-science interviews in distant future and I am aiming to have in-depth Statistical(at par with Statistics grads) and Machine-Learning knowledge. Can you guys suggest the best course of books/videos to prepare myself?
I'm making a PUT request in order to upload data on Google Storage. But I'd like to upload big data, files around 2GB or so and I'd like to make a multi-part request. I mean, to... moreI'm making a PUT request in order to upload data on Google Storage. But I'd like to upload big data, files around 2GB or so and I'd like to make a multi-part request. I mean, to upload an object in smaller parts and my application doesn't do it so far...Does anyone know if this is possible by using PUT method? As I saw on Google Cloud's documentation, they use POST method: https://cloud.google.com/storage/docs/json_api/v1/how-tos/upload
But I'd like to use PUT method instead.