Hi i can't resolve my problem when running hadoop with start-all.sh
rochdi@127:~$ start-all.sh
/usr/local/hadoop/bin/hadoop-daemon.sh: line 62: [: localhost: integer expression... moreHi i can't resolve my problem when running hadoop with start-all.sh
rochdi@127:~$ start-all.sh
/usr/local/hadoop/bin/hadoop-daemon.sh: line 62: [: localhost: integer expression expected
starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-rochdi-namenode-127.0.0.1
localhost: /usr/local/hadoop/bin/hadoop-daemon.sh: line 62: [: localhost: integer expression expected
localhost: starting datanode, logging to /usr/local/hadoop/libexec/../logs/hadoop-rochdi-datanode-127.0.0.1
localhost: /usr/local/hadoop/bin/hadoop-daemon.sh: line 62: [: localhost: integer expression expected
localhost: starting secondarynamenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-rochdi-secondarynamenode-127.0.0.1
/usr/local/hadoop/bin/hadoop-daemon.sh: line 62: [: localhost: integer expression expected
starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-rochdi-jobtracker-127.0.0.1
localhost: /usr/local/hadoop/bin/hadoop-daemon.sh: line 62: [: localhost: integer expression... less
I am new to Tableau world. I have traffic data in a .csv file containing latitude and longitude values. I have loaded the data into Tableau as a symbol Map. I need to show the... moreI am new to Tableau world. I have traffic data in a .csv file containing latitude and longitude values. I have loaded the data into Tableau as a symbol Map. I need to show the corresponding location on the map in Tableau. Can somebody suggest me how to do this?
Another concern is that in tableau maps, there is no caption for any location as we can see in Google Maps. Can I change a raw map to more informative in terms of showing location data such as a city, an IT campus, etc?
Thank you
I already have a cluster of 3 machines (ubuntu1,ubuntu2,ubuntu3 by VM virtualbox) running Hadoop 1.0.0. I installed spark on each of these machines. ub1 is my master node and the... moreI already have a cluster of 3 machines (ubuntu1,ubuntu2,ubuntu3 by VM virtualbox) running Hadoop 1.0.0. I installed spark on each of these machines. ub1 is my master node and the other nodes are working as slave. My question is what exactly a spark driver is? and should we set a IP and port to spark driver by spark.driver.host and where it will be executed and located? (master or slave)
I want to train a simple neural network with PyTorch on a pandas dataframe df.
One of the columns is named "Target", and it is the target variable of the network. How can I use... moreI want to train a simple neural network with PyTorch on a pandas dataframe df.
One of the columns is named "Target", and it is the target variable of the network. How can I use this dataframe as input to the PyTorch network?
I tried this, but it doesn't work:
import pandas as pd
import torch.utils.data as data_utils
I've recently started getting into data analysis and I've learned quite a bit over the last year (at the moment, pretty much exclusively using Python). I feel the next step is to... moreI've recently started getting into data analysis and I've learned quite a bit over the last year (at the moment, pretty much exclusively using Python). I feel the next step is to begin training myself in MapReduce/Hadoop. I have no formal computer science training however and so often don't quite understand the jargon that is used when people write about Hadoop, hence my question here.
What I am hoping for is a top level overview of Hadoop (unless there is something else I should be using?) and perhaps a recommendation for some sort of tutorial/text book.
If, for example, I want to parallelise a neural network which I have written in Python, where would I start? Is there a relatively standard method for implementing Hadoop with an algorithm or is each solution very problem specific?
The Apache wiki page describes Hadoop as "a framework for running applications on large cluster built of commodity hardware". But what does that mean? I've heard the term "Hadoop Cluster" and I know that Hadoop is Java... less
Hi In the University in the data science area we learned that if we wanted to work with small data we should use pandas and if we work with Big Data we schould use spark, in the... moreHi In the University in the data science area we learned that if we wanted to work with small data we should use pandas and if we work with Big Data we schould use spark, in the case of Python programmers PySpark.
Recently I saw in a Hackaton in the cloud (azure Synapse, that work inside in Spark) importing pandas in the notebook ( I suppose the code is good cause was made from Microsoft people)
import pandas
from azureml.core import Dataset
training_pd = training_data.toPandas().to_csv('training_pd.csv', index=False)
I have a option of using Sqoop or Informatica Big Data edition to source data into HDFS. The source systems are Tearadata, Oracle.
I would like to know which one is better and any... moreI have a option of using Sqoop or Informatica Big Data edition to source data into HDFS. The source systems are Tearadata, Oracle.
I would like to know which one is better and any reason behind the same.
Note: My current utility is able to pull data using sqoop into HDFS , Create Hive staging table and archive external table.
Informatica is the ETL tool used in the organization.
Regards Sanjeeb
I copied and pasted tensorflow's official Basic classification: Classify images of clothing code https://www.tensorflow.org/tutorials/keras/classification
import tensorflow as... moreI copied and pasted tensorflow's official Basic classification: Classify images of clothing code https://www.tensorflow.org/tutorials/keras/classification
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
and ran it. Upon running it printed a load of gibberish and wouldn't stop (almost like when you accidentally put a print in a while loop):
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
so I terminated it. The above is just a VERY small portion of what printed. I ran it again, only to get an error straight away.
line 7, in <module>
(train_images, train_labels), (test_images,... less
I'm learning object oriented programing in a data science context.
I want to understand what good practice is in terms of writing methods within a class that relate to one... moreI'm learning object oriented programing in a data science context.
I want to understand what good practice is in terms of writing methods within a class that relate to one another.
When I run my code:
import pandas as pd
pd.options.mode.chained_assignment = None
class MyData:
def __init__(self, file_path):
self.file_path = file_path
def prepper_fun(self):
'''Reads in an excel sheet, gets rid of missing values and sets datatype to numerical'''
df = pd.read_excel(self.file_path)
df = df.dropna()
df = df.apply(pd.to_numeric)
self.df = df
return(df)
def quality_fun(self):
'''Checks if any value in any column is more than 10. If it is, the value is replaced with
a warning 'check the original data value'.'''
for col in self.df.columns:
for row in self.df.index:
if self.df > 10:
self.df = str('check original data value')
return(self.df)
Ops: This does not belong to ServerFault because it focuses on Programing Architecture.
I have following questions regarding differences between Cloud and... moreOps: This does not belong to ServerFault because it focuses on Programing Architecture.
I have following questions regarding differences between Cloud and Virtualization..
How Cloud is different then Virtualization?
Currently I tried to find out pricing of Rackspace, Amazone and all similar cloud providers, I found that our current 6 dedicated servers came cheaper then their pricing. So how one can claim cloud is cheaper? Is it cheaper only in comparison of normal hosting?
We re organized our infrastructure in virtual environment to reduce or configuration overhead at time of failure, we did not have to rewrite any peice of code that is already written for earlier setup. So moving to virtualization does not require any re programming. But cloud is absoltely different and it will require entire reprogramming right?
Is it really worth to recode when our current IT costs are 3-4 times lower then cloud hosting including raid backups and all sort of clustering for high availability?
I'm trying to implement a Lambda Architecture using the following tools: Apache Kafka to receive all the datapoints, Spark for batch processing (Big Data), Spark Streaming for... moreI'm trying to implement a Lambda Architecture using the following tools: Apache Kafka to receive all the datapoints, Spark for batch processing (Big Data), Spark Streaming for real time (Fast Data) and Cassandra to store the results.
Also, all the datapoints I receive are related to a user session, and therefore, for the batch processing I'm only interested to process the datapoints once the session finishes. So, since I'm using Kafka, the only way to solve this (assuming that all the datapoints are stored in the same topic) is for the batch to fetch all the messages in the topic, and then ignore those that correspond to sessions that have not yet finished.
So, what I'd like to ask is:
Is this a good approach to implement the Lambda Architecture? Or should use Haddop and Storm instead? (I can't find information about people using Kafka and Apache Spark for batch processing, Map Reduce)
Is there a better approach to solve the user sessions problem?
I have a bunch of client point of sale (POS) systems that periodically send new sales data to one centralized database, which stores the data into one big database for report... moreI have a bunch of client point of sale (POS) systems that periodically send new sales data to one centralized database, which stores the data into one big database for report generation.
The client POS is based on PHPPOS, and I have implemented a module that uses the standard XML-RPC library to send sales data to the service. The server system is built on CodeIgniter, and uses the XML-RPC and XML-RPCS libraries for the webservice component. Whenever I send a lot of sales data (as little as 50 rows from the sales table, and individual rows from sales_items pertaining to each item within the sale) I get the following error:
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 54 bytes)
128M is the default value in php.in, but I assume that is a huge number to break. In fact, I have even tried setting this value to 1024M, and all it does is take a longer time to error out.
As for steps I've taken, I've tried disabling all processing on the server-side, and have rigged it... less
I'm very new to tableau, and would like to know how to convert this SQL to tableau.
select case when RD = 1 then 'RD'
else
case when... moreI'm very new to tableau, and would like to know how to convert this SQL to tableau.
select case when RD = 1 then 'RD'
else
case when Claim_FeatureStatus <> 'Re-opened'
and subro_only = 0
and SIU=0
then 'Open'
when Claim_FeatureStatus = 'Re-opened'
and subro_only = 0
and SIU = 0
then 'Re-Opened'
when SIU = 1
then 'SIU'
else 'Subrogation'
end
end as ClaimStatus less
I tried to install XGBoost package in python. I am using windows os, 64bits . I have gone through following.
The package directory states that xgboost is unstable for windows and... moreI tried to install XGBoost package in python. I am using windows os, 64bits . I have gone through following.
The package directory states that xgboost is unstable for windows and is disabled: pip installation on windows is currently disabled for further invesigation, please install from github. https://pypi.python.org/pypi/xgboost/
I am not well versed in Visual Studio, facing problem building XGBoost. I am missing opportunities to utilize xgboost package in data science.
Please guide, so that I can import the XGBoost package in python.
Thanks less
I'm trying to install specific PyTorch version under conda env:
Using pip:
pip3 install pytorch==1.0.1
WARNING: pip is being invoked by an old script wrapper. This will fail in a... moreI'm trying to install specific PyTorch version under conda env:
Using pip:
pip3 install pytorch==1.0.1
WARNING: pip is being invoked by an old script wrapper. This will fail in a future version of pip.
Please see https://github.com/pypa/pip/issues/5599 for advice on fixing the underlying issue.
To avoid this problem you can invoke Python with '-m pip' instead of running pip directly.
Defaulting to user installation because normal site-packages is not writeable
ERROR: Could not find a version that satisfies the requirement pytorch==1.0.1 (from versions: 0.1.2, 1.0.2)
ERROR: No matching distribution found for pytorch==1.0.1
Using Conda:
conda install pytorch==1.0.1
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
I have two question:
(1) The question about importing some subpackages inside tensorflow.keras.
(2) How to differentiate between the packages installed by 'pip install' and 'conda... moreI have two question:
(1) The question about importing some subpackages inside tensorflow.keras.
(2) How to differentiate between the packages installed by 'pip install' and 'conda install'.(under windows)
I am using anaconda with tensorflow 2.0.0. I am trying to import package like:
import tensorflow.keras.utils.np_utils
However, the error shown that:
---------------------------------------------------------------------------
> ModuleNotFoundError Traceback (most recent call
> last) <ipython-input-2-ee1bc59a14ab> in <module>
> ----> 1 import tensorflow.keras.utils.np_utils
>
> ModuleNotFoundError: No module named 'tensorflow.keras.utils.np_utils'
I am confused about why this is happening, I install the tensorflow with command:
conda install tensorflow==2.0.0
from Anaconda prompt.
Yes, I know the anaconda should have already had all the data science package inside it, the reason that I uninstall tensorflow provided by anaconda and reinstall it was before using... less
I am learning assembly language in my spare time to become a better developer.
I understand the difference between stack-based machines and register-based machines at a conceptual... moreI am learning assembly language in my spare time to become a better developer.
I understand the difference between stack-based machines and register-based machines at a conceptual level, but I am wondering how stack-based machines are actually implemented. If a virtual machine, e.g. JVM or .NET, runs on a register-based architecture, e.g. x86 or x64, then it must use registers at the assembly level (as far as I am concerned). I am obviously missing something here. Therefore I am unsure of the distinction at assembly language.
I have read articles on here e.g. Stack-based machine depends on a register-based machine? and also on Wikipedia, but I don't believe they answer my question directly. less
I have a list of strings containing arbitary phone numbers in python. The extension is an optional part.
st =
My objective is to segregate the phone numbers so that I can... moreI have a list of strings containing arbitary phone numbers in python. The extension is an optional part.
st =
My objective is to segregate the phone numbers so that I can isolate each individual group viz. '800', '555', '1212' and the optional '1234'.
I have tried out the following code.
p1 = re.compile(r'(\d{3}).*(\d{3}).*(\d{4}).*(\d{4})?')
step1 =
p2 = re.compile(r'(\d{3})(\d{3})(\d{4})(\d{4})?')
step2 =
p1 and p2 being the two patterns to fetch the desired output.
for i in range(len(step2)):
print step2
Since I am a newbie, I wish to get suggestions if there are better ways to tacle such problems or some best practices followed in Python community. Thanks in advance.
I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support.... moreI have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support. However, SAS is horrible as a piece of software for numerous other reasons.
One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.
My first thought is to use HDFS store to hold large datasets on disk and pull only the pieces I need into dataframes for analysis. Others have mentioned MongoDB as an easier to use alternative. My question is this:
What are some best-practice workflows for accomplishing the following:
Loading flat files into a permanent, on-disk database structure
Querying that database to retrieve data to feed into a pandas data structure
Updating the database after manipulating pieces in... less
I have created 2 python sets created from 2 different CSV files which contains some stings.
I am trying to match the 2 sets so that it will return an intersection of the 2 (the... moreI have created 2 python sets created from 2 different CSV files which contains some stings.
I am trying to match the 2 sets so that it will return an intersection of the 2 (the common strings from both the sets should be returned).
This is how my code looks:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import nltk
#using content mmanager to open and read file
#converted the text file into csv file at the source using Notepad++
with open(r'skills.csv', 'r', encoding="utf-8-sig") as f:
myskills = f.readlines()
#converting mall the string in the list to lowercase
list_of_myskills = map(lambda x: x.lower(), myskills)
set_of_myskills = set(list_of_myskills)
#print(type(nodup_filtered_content))
print(set_of_myskills)
#open and read by line from the text file
with open(r'list_of_skills.csv', 'r') as f2:
#using readlines() instead of read(), becasue it reads line by line (each
line as a string obj in the python list)
contents_f2 =... less
I have 2 grayscale images say G1 and G2 . I also have the statistics (min ,max ,mean and Standard Deviation). I would like to change G2 such that the statistics of G2 (min... moreI have 2 grayscale images say G1 and G2 . I also have the statistics (min ,max ,mean and Standard Deviation). I would like to change G2 such that the statistics of G2 (min ,max,mean and SD)match G1. I have tried arithmetic scaling and got the min and max values of both G1 and G2 to match but mean and SD are still different. I have also tried Histogram fitting of G2 in G1 but that did not do what i wanted either. I am using a software called SPIDER this a question applicable to image-processing which can be performed using different software packages(OpenCV MATLABetc) .Any ideas and suggestions would be greatly appreciated.
True ... it has been discussed quite a lot.
However there is a lot of ambiguity and some of the answers provided ... including duplicating jar references in the... moreTrue ... it has been discussed quite a lot.
However there is a lot of ambiguity and some of the answers provided ... including duplicating jar references in the jars/executor/driver configuration or options.
The ambiguous and/or omitted details
Following ambiguity, unclear, and/or omitted details should be clarified for each option:
How ClassPath is affected
Driver
Executor (for tasks running)
Both
not at all
Separation character: comma, colon, semicolon
If provided files are automatically distributed
for the tasks (to each executor)
for the remote Driver (if ran in cluster mode)
type of URI accepted: local file, hdfs, http, etc
If copied into a common location, where that location is (hdfs, local?)
The options to which it affects :
--jarsSparkContext.addJar(...) methodSparkContext.addFile(...) method--conf spark.driver.extraClassPath=... or --driver-class-path ...--conf spark.driver.extraLibraryPath=..., or --driver-library-path ...--conf spark.executor.extraClassPath=...--conf... less
Related question here.
So I have a character vector with currency values that contain both dollar signs and commas. However, I want to try and remove both the commas and dollar... moreRelated question here.
So I have a character vector with currency values that contain both dollar signs and commas. However, I want to try and remove both the commas and dollar signs in the same step.
This removes dollar signs =
d = c("$0.00", "$10,598.90", "$13,082.47") gsub('\\$', '', d)
This removes commas =
library(stringr) str_replace_all(c("10,0","tat,y"), fixed(c(","), "")
I'm wondering if I could remove both characters in one step.
I realize that I could just save the gsub results into a new variable, and then reapply that (or another function) on that variable. But I guess I'm wondering about a single step to do both. less