When I use Apache Flume, I get a millisecond timestamp rahter then a second timestamp. This is my flume conf file:# Name the components on this agenta1.sources = r1a1.sinks =... moreWhen I use Apache Flume, I get a millisecond timestamp rahter then a second timestamp. This is my flume conf file:# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1# Describe/configure the sourcea1.sources.r1.type = org.apache.flume.source.http.HTTPSourcea1.sources.r1.port = 44444# Describe the sinka1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path = flume/ads/%y-%m-%d/%Ha1.sinks.k1.hdfs.fileType = DataStream# Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 10000# Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1Flume creates folder flume/ads/70-01-17/02. The folder contains files "FlumeData.timestamp" and this timestamp has twelve digits.I get an incorrect folder's name.What can I do?hadoop flume less
I am using hadoop 2.6.0,now i am trying sqoop-1.4.5.bin__hadoop-2.0.4-alpha.tar.gz.I am getting sqoop version usingsqoop version2016-10-19 16:11:21,722 - INFO - Running Sqoop... moreI am using hadoop 2.6.0,now i am trying sqoop-1.4.5.bin__hadoop-2.0.4-alpha.tar.gz.I am getting sqoop version usingsqoop version2016-10-19 16:11:21,722 - INFO - Running Sqoop version: 1.4.5Sqoop 1.4.5but if i am trying any sqoop command it's giving the following exception,sqoop list-tables --connect jdbc:mysql://localhost/test --username root --password hadoopException in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.addDeprecations([Lorg/apache/hadoop/conf/Configuration$DeprecationDelta;)VI copied mysql connector jar to sqoop/lib also.I am not able to find the cause.anybody have any idea please share me how to solve this. less
I am a newbie in hadoop framework. So it would help me if someone can guide me thru this. I have two type of files. dirA/ --> file_a , file_b, file_cdirB/ -->... moreI am a newbie in hadoop framework. So it would help me if someone can guide me thru this. I have two type of files. dirA/ --> file_a , file_b, file_cdirB/ --> another_file_a, another_file_b...Files in directory A contains tranascation information.So something like: id, time_stamp 1 , some_time_stamp 2 , some_another_time_stamp 1 , another_time_stampSo, this kind of information is scattered across all the files in dirA. Now 1st thing to do is: I give a time frame (lets say last week) and I want to find all the unique ids which are present between that time frame.So, save a file.Now, dirB files contains the address information. Something like: id, address, zip code 1, fooadd, 12345 and so onSo all the unique ids outputted by the first file.. I take them as input and then find the address and zip code.basically the final out is like the sql merge.Find all the unique ids between a time frame and then merge the address infomration.I would greatly appreciate any help. Thanks less
Is there a way to locate a specific file in hadoop?
I know, that I can use this: hadoop fs -find /some_directory
But, is there a command like this: hadoop locate some_file_name?
I'd like to apply qdap's polarity function to a vector of documents, each of which could contain multiple sentences, and obtain the corresponding polarity for each document.... moreI'd like to apply qdap's polarity function to a vector of documents, each of which could contain multiple sentences, and obtain the corresponding polarity for each document. For example:library(qdap)polarity(DATA$state)$all$polarity# Results: -0.8165 -0.4082 0.0000 -0.8944 0.0000 0.0000 0.0000 -0.5774 0.0000 0.4082 0.0000Warning message:In polarity(DATA$state) : Some rows contain double punctuation. Suggested use of `sentSplit` function.This warning can't be ignored, as it seems to add the polarity scores of each sentence in the document. This can result in document-level polarity scores outside the bounds.I'm aware of the option to first run sentSplit and then average across the sentences, perhaps weighting polarity by word count, but this is (1) inefficient (takes roughly 4x as long as running on the full documents with the warning), and (2) unclear how to weight sentences. This option would look something like this:DATA$id <- seq(nrow(DATA)) # For identifying and aggregating documents... less
Let's assume I have some data I obtained empirically:
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 *... moreLet's assume I have some data I obtained empirically:
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
It is exponentially distributed (with some noise) and I want to verify this using a chi-squared goodness of fit (GoF) test. What is the simplest way of doing this using the standard scientific libraries in Python (e.g. scipy or statsmodels) with the least amount of manual steps and assumptions?I can fit a model with:
param = stats.expon.fit(x)
plt.hist(x, normed=True, color='white', hatch='/')
plt.plot(grid, distr.pdf(np.linspace(0, 100, 10000), *param))
It is very elegant to calculate the Kolmogorov-Smirnov test.
>>> stats.kstest(x, lambda x : stats.expon.cdf(x, *param))
(0.0061000000000000004, 0.85077099515985011)
However, I can't find a good way of calculating the chi-squared test.There is a chi-squared GoF function in statsmodel, but it assumes a discrete distribution (and the exponential distribution is... less
How can I make a dashboard in QlikView, showing variances for some results?
The final screen should:
Show the... moreHow can I make a dashboard in QlikView, showing variances for some results?
The final screen should:
Show the results.
Show up-or-down arrow for every result.
I'm pretty sure it is possible, as Google image search (for a keyword 'qlikview') shows such dashboard (I highlighted those arrows with a black rectangle):
I often get the problem when this or that alias name is already used somewhere, and I can't easily find that variable or aggregation to release the name.
Is there some place in... moreI often get the problem when this or that alias name is already used somewhere, and I can't easily find that variable or aggregation to release the name.
Is there some place in Tableau where I can view/edit/reset full list of aliases?
I have applied simple forecasting models such as Naive Forecast, Moving Average, Simple Exponential Smoothing, Holts Linear Trend Model on 2018 sales data of a salesperson.
All... moreI have applied simple forecasting models such as Naive Forecast, Moving Average, Simple Exponential Smoothing, Holts Linear Trend Model on 2018 sales data of a salesperson.
All the model resulted in flatten or prediction line flattens at zero. Could be it be an issue with data? as most of the data is flatten at zero.
model = ARIMA(train_log, order=(0, 1, 2))
output = model.fit(disp=-1)
#convert fitted values in to series
output_series=pd.Series(output.fittedvalues, copy=True)
print(output_series.head())
#Calc Cumm sum
output_series_cumsum= output_series.cumsum()
print(output_series.head())
#convert to predicted ARIMA vlaues to original format
convert_output = np.exp(output_tr_log)
plt.title('RMSE: %.4f'% (np.sqrt(np.dot(convert_output, train_log))/len(train_log))
Date Sales
---- -----
2018-01-27 1
2018-01-30 ... less
I need to generate periodic (daily, monthly) web analytics dashboard reports. They will be static and don't require interaction, so imagine a PDF file as the target output. The... moreI need to generate periodic (daily, monthly) web analytics dashboard reports. They will be static and don't require interaction, so imagine a PDF file as the target output. The reports will mix tables and charts (mainly sparkline and bullet graphs created with ggplot2). Think Stephen Few/Perceptual Edge style dashboards, such as:
but applied to web analytics.
Any suggestions on what packages to use creating these dashboard reports?
My first intuition is to use R markdown and knitr, but perhaps you've found a better solution. I can't seem to find rich examples of dashboards generated from R. less
I'm trying to predict age from a given picture. I built the model below but the problem is that I'm getting very large loss value with low accuracy while fitting the model.I... moreI'm trying to predict age from a given picture. I built the model below but the problem is that I'm getting very large loss value with low accuracy while fitting the model.I think the problem is choosing the wrong loss function (here mean_squared_error). What can be the problem here?import tensorflow as tffrom tensorflow import kerasX = X.reshape(-1, image_size, image_size, 1)model = keras.models.Sequential()model.add(keras.layers.Conv2D(32, (5, 5), activation='relu', input_shape=X.shape))model.add(keras.layers.MaxPooling2D((2, 2)))model.add(keras.layers.Conv2D(32, (3, 3), activation='relu'))model.add(keras.layers.MaxPooling2D(2, 2))model.add(keras.layers.Conv2D(64, (3, 3), activation='relu'))model.add(keras.layers.Flatten())model.add(keras.layers.Dense(60, activation='relu'))model.add(keras.layers.Dropout(0.4))model.add(keras.layers.Dense(1, activation='softmax'))model.compile(optimizer='adam', loss=keras.losses.mean_squared_error, metrics=)model.fit(X, Y, epochs=170, shuffle=True,... less
In the IoT, alternatives to cloud computing architecture exist. These move part of the computation to the lower levels. Examples are edge computing and fog computing.
What are... more
In the IoT, alternatives to cloud computing architecture exist. These move part of the computation to the lower levels. Examples are edge computing and fog computing.
What are the differences between edge computing and fog computing?
I am new to Tableau and trying to understand how convenient it is to prepare a dashboard which can be accessed through web browsers, tablets and mobile phones.I have few questions... moreI am new to Tableau and trying to understand how convenient it is to prepare a dashboard which can be accessed through web browsers, tablets and mobile phones.I have few questions which are:Does tableau already provides a responsive dashboard which adjusts itself to the device width?Can we customize the look and feel of the charts/dashboard to suit our requirements for browsers and tablets?Is there a demo available where I can have a look how a tableau dashboard looks like on various screen widths?Which mobile platforms does tableau supports (Android, iOS, Win) less
I am trying to plot a simple function in python ( x + sqrt(x^2 + 2x) ). Here is my code:
import pylab as pl
import numpy as np
import... moreI am trying to plot a simple function in python ( x + sqrt(x^2 + 2x) ). Here is my code:
import pylab as pl
import numpy as np
import math
X = np.linspace(-999999,999999)
Y = (X+math.sqrt(X**2+2*X))
pl.plot(X,Y)
pl.show()
Here is the error that I am facing: TypeError: only length -1 arrays can be converted to Python scalars
I have two masters in pure mathematics and I am thinking to go for data analyst job. I have background in mathematics and statistics but I don't know any programming language... moreI have two masters in pure mathematics and I am thinking to go for data analyst job. I have background in mathematics and statistics but I don't know any programming language (thought I deal with Mathematica, Matlab etc). Is it possible for me to go for data analysis. What I need to learn?
I am using Apache Spark to perform sentiment analysis.I am using Naive Bayes algorithm to classify the text. I don't know how to find out the probability of labels. I would be... more
I am using Apache Spark to perform sentiment analysis.I am using Naive Bayes algorithm to classify the text. I don't know how to find out the probability of labels. I would be grateful if I know get some snippet in python to find the probability of labels.
I have a large set of data (about 8GB). I would like to use machine learning to analyze it. So, I think that I should use SVD then PCA to reduce the data dimensionality for... more
I have a large set of data (about 8GB). I would like to use machine learning to analyze it. So, I think that I should use SVD then PCA to reduce the data dimensionality for efficiency. However, MATLAB and Octave cannot load such a large dataset.
What tools I can use to do SVD with such a large amount of data?
I am currently trying to open a file with pandas and python for machine learning purposes it would be ideal for me to have them all in a DataFrame. Now The file is 18GB large... more
I am currently trying to open a file with pandas and python for machine learning purposes it would be ideal for me to have them all in a DataFrame. Now The file is 18GB large and my RAM is 32 GB but I keep getting memory errors.
From your experience is it possible? If not do you know of a better way to go around this? (hive table? increase the size of my RAM to 64? create a database and access it from python)
An aspiring data scientist here. I don't know anything about Hadoop, but as I have been reading about Data Science and Big Data, I see a lot of talk about Hadoop. Is it... more
An aspiring data scientist here. I don't know anything about Hadoop, but as I have been reading about Data Science and Big Data, I see a lot of talk about Hadoop. Is it absolutely necessary to learn Hadoop to be a Data Scientist?
I have built a 3 layer neural network to perform a binary mapping (2016 inputs, 288 outputs.) I am getting decent results with mean square error and stochastic gradient decent. My... moreI have built a 3 layer neural network to perform a binary mapping (2016 inputs, 288 outputs.) I am getting decent results with mean square error and stochastic gradient decent. My question is: Is there a more appropriate loss function for regression when the output is binary?
I have a problem involving a collection of continuous probability distribution functions, most of which are determined empirically (e.g. departure times, transit times). What I... moreI have a problem involving a collection of continuous probability distribution functions, most of which are determined empirically (e.g. departure times, transit times). What I need is some way of taking two of these PDFs and doing arithmetic on them. E.g. if I have two values x taken from PDF X, and y taken from PDF Y, I need to get the PDF for (x+y), or any other operation f(x,y).
An analytical solution is not possible, so what I'm looking for is some representation of PDFs that allows such things. An obvious (but computationally expensive) solution is monte-carlo: generate lots of values of x and y, and then just measure f(x, y). But that takes too much CPU time.
I did think about representing the PDF as a list of ranges where each range has a roughly equal probability, effectively representing the PDF as the union of a list of uniform distributions. But I can't see how to combine them.
Does anyone have any good solutions to this problem?
Edit: The goal is to create a mini-language (aka Domain... less
Having a lot of text documents (in natural language, unstructured), what are the possible ways of annotating them with some semantic meta-data? For example, consider a short... moreHaving a lot of text documents (in natural language, unstructured), what are the possible ways of annotating them with some semantic meta-data? For example, consider a short document:I saw the company's manager last day.To be able to extract information from it, it must be annotated with additional data to be less ambiguous. The process of finding such meta-data is not in question, so assume it is done manually. The question is how to store these data in a way that further analysis on it can be done more conveniently/efficiently?A possible approach is to use XML tags (see below), but it seems too verbose, and maybe there are better approaches/guidelines for storing such meta-data on text documents.I saw the company'smanager last day. less
I am attempting to use the tm package to convert a vector of text strings to a corpus element.
My code looks something like... more
I am attempting to use the tm package to convert a vector of text strings to a corpus element.
My code looks something like this
Corpus(d1$Yes)
where d1$Yes is a factor with 124 levels, each containing a text string.
For example, d1$Yes = "So we can get the boat out!"
I'm receiving the following error: "Error: inherits(x, "Source") is not TRUE"
I'm not sure how to remedy this.