Good afternoon.
I have this question I am trying to solve using "panda" statistical data structures and related syntax from the Python scripting language. I am already graduated... moreGood afternoon.
I have this question I am trying to solve using "panda" statistical data structures and related syntax from the Python scripting language. I am already graduated from a US university and employed while currently taking the Coursera.org course of "Python for Data Science" just for professional development, which is offered online at Coursera's platform by the University of Michigan. I'm not sharing answers to anyone either as I abide by Coursera's Honor Code.
First, I was given this panda dataframe chart concerning Olympic medals won by countries around the world:
# Summer Gold Silver Bronze Total # Winter Gold.1 Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total ID
I am working on a problem for a Intro to Data Science course on Coursera, and I am struggling with adding data to a column in a dataframe.
This is the data set I'm working... moreI am working on a problem for a Intro to Data Science course on Coursera, and I am struggling with adding data to a column in a dataframe.
This is the data set I'm working with:
SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME
1 50 3 6 1 1 Alabama Autauga County
2 50 3 6 1 3 Alabama Baldwin County
3 50 3 6 1 5 Alabama Barbour County
4 50 3 6 1 7 Alabama Bibb County
What I am trying to do is to insert a column called TotalCounties that has the total count of counties by state as a last column. I've done similar things in SQL, but it doesn't seem to work quite the same in Python.
I have tried the code below, but the column ends up displaying as NaN instead of a number like I want it to.
counties_only_df = census_df[census_df
x = counties_only_df.groupby('STNAME').count()
counties_only_df = x
I often want to quickly save some Python data, but I would also like to save it in a stable file format in case the date lingers for a long time. And so I have the question, how... moreI often want to quickly save some Python data, but I would also like to save it in a stable file format in case the date lingers for a long time. And so I have the question, how can I save my data?
In data science, there are three kinds of data I want to store -- arbitrary Python objects, numpy arrays, and Pandas dataframes. -- what are the stable ways of storing these?
I want to train a simple neural network with PyTorch on a pandas dataframe df.
One of the columns is named "Target", and it is the target variable of the network. How can I use... moreI want to train a simple neural network with PyTorch on a pandas dataframe df.
One of the columns is named "Target", and it is the target variable of the network. How can I use this dataframe as input to the PyTorch network?
I tried this, but it doesn't work:
import pandas as pd
import torch.utils.data as data_utils
Hi In the University in the data science area we learned that if we wanted to work with small data we should use pandas and if we work with Big Data we schould use spark, in the... moreHi In the University in the data science area we learned that if we wanted to work with small data we should use pandas and if we work with Big Data we schould use spark, in the case of Python programmers PySpark.
Recently I saw in a Hackaton in the cloud (azure Synapse, that work inside in Spark) importing pandas in the notebook ( I suppose the code is good cause was made from Microsoft people)
import pandas
from azureml.core import Dataset
training_pd = training_data.toPandas().to_csv('training_pd.csv', index=False)
I'm learning object oriented programing in a data science context.
I want to understand what good practice is in terms of writing methods within a class that relate to one... moreI'm learning object oriented programing in a data science context.
I want to understand what good practice is in terms of writing methods within a class that relate to one another.
When I run my code:
import pandas as pd
pd.options.mode.chained_assignment = None
class MyData:
def __init__(self, file_path):
self.file_path = file_path
def prepper_fun(self):
'''Reads in an excel sheet, gets rid of missing values and sets datatype to numerical'''
df = pd.read_excel(self.file_path)
df = df.dropna()
df = df.apply(pd.to_numeric)
self.df = df
return(df)
def quality_fun(self):
'''Checks if any value in any column is more than 10. If it is, the value is replaced with
a warning 'check the original data value'.'''
for col in self.df.columns:
for row in self.df.index:
if self.df > 10:
self.df = str('check original data value')
return(self.df)
I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support.... moreI have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support. However, SAS is horrible as a piece of software for numerous other reasons.
One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.
My first thought is to use HDFS store to hold large datasets on disk and pull only the pieces I need into dataframes for analysis. Others have mentioned MongoDB as an easier to use alternative. My question is this:
What are some best-practice workflows for accomplishing the following:
Loading flat files into a permanent, on-disk database structure
Querying that database to retrieve data to feed into a pandas data structure
Updating the database after manipulating pieces in... less
This is my problem: Cousera course on Apllied Data Science in Python I am doing Assigment 2.
Question 1 Which country has won the most gold medals in summer games? This function... moreThis is my problem: Cousera course on Apllied Data Science in Python I am doing Assigment 2.
Question 1 Which country has won the most gold medals in summer games? This function should return a single string value.
This my code:
def answer_one():
return df[df == df.index(0)
answer_one()
This is the error which I am getting:
NameError: name 'df' is not defined
Can any one tell my what that part (town = thisLine)exactly do?
def get_list_of_university_towns():
'''Returns a DataFrame of towns and the states they are in from the... moreCan any one tell my what that part (town = thisLine)exactly do?
def get_list_of_university_towns():
'''Returns a DataFrame of towns and the states they are in from the
university_towns.txt list. The format of the DataFrame should be:
DataFrame( [ , ,
columns= )
The following cleaning needs to be done:
1. For "State", removing characters from "[" to the end.
2. For "RegionName", when applicable, removing every character from " (" to the end.
3. Depending on how you read the data, you may need to remove newline character '\n'. '''
data =
state = None
state_towns =
with open('university_towns.txt') as file:
for line in file:
thisLine = line
if thisLine == '':
state = thisLine
continue
if '(' in line:
town = thisLine
state_towns.append()
else:
town = thisLine
state_towns.append()
data.append(thisLine)
df = pd.DataFrame(state_towns,columns = )
return df
I'm learning object oriented programing in a data science context.
I want to understand what good practice is in terms of writing methods within a class that relate to one... moreI'm learning object oriented programing in a data science context.
I want to understand what good practice is in terms of writing methods within a class that relate to one another.
When I run my code:
import pandas as pd
pd.options.mode.chained_assignment = None
class MyData:
def __init__(self, file_path):
self.file_path = file_path
def prepper_fun(self):
'''Reads in an excel sheet, gets rid of missing values and sets datatype to numerical'''
df = pd.read_excel(self.file_path)
df = df.dropna()
df = df.apply(pd.to_numeric)
self.df = df
return(df)
def quality_fun(self):
'''Checks if any value in any column is more than 10. If it is, the value is replaced with
a warning 'check the original data value'.'''
for col in self.df.columns:
for row in self.df.index:
if self.df > 10:
self.df = str('check original data value')
return(self.df)
Good afternoon.
I have this question I am trying to solve using "panda" statistical data structures and related syntax from the Python scripting language. I am already graduated... moreGood afternoon.
I have this question I am trying to solve using "panda" statistical data structures and related syntax from the Python scripting language. I am already graduated from a US university and employed while currently taking the Coursera.org course of "Python for Data Science" just for professional development, which is offered online at Coursera's platform by the University of Michigan. I'm not sharing answers to anyone either as I abide by Coursera's Honor Code.
First, I was given this panda dataframe chart concerning Olympic medals won by countries around the world:
# Summer Gold Silver Bronze Total # Winter Gold.1 Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total ID
I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support.... moreI have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support. However, SAS is horrible as a piece of software for numerous other reasons.
One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.
My first thought is to use HDF store to hold large datasets on disk and pull only the pieces I need into dataframes for analysis. Others have mentioned MongoDB as an easier to use alternative. My question is this:
What are some best-practice workflows for accomplishing the following:
Loading flat files into a permanent, on-disk database structure
Querying that database to retrieve data to feed into a pandas data structure
Updating the database after manipulating pieces in... less
Can any one tell my what that part (town = thisLine)exactly do?
def get_list_of_university_towns():
'''Returns a DataFrame of towns and the states they are in from the... moreCan any one tell my what that part (town = thisLine)exactly do?
def get_list_of_university_towns():
'''Returns a DataFrame of towns and the states they are in from the
university_towns.txt list. The format of the DataFrame should be:
DataFrame( [ , ,
columns= )
The following cleaning needs to be done:
1. For "State", removing characters from "[" to the end.
2. For "RegionName", when applicable, removing every character from " (" to the end.
3. Depending on how you read the data, you may need to remove newline character '\n'. '''
data =
state = None
state_towns =
with open('university_towns.txt') as file:
for line in file:
thisLine = line
if thisLine == '':
state = thisLine
continue
if '(' in line:
town = thisLine
state_towns.append()
else:
town = thisLine
state_towns.append()
data.append(thisLine)
df = pd.DataFrame(state_towns,columns = )
return df
get_list_of_university_towns() less
I have lots of excel files(xlsx format) and want to read and handle them.
For example, file names are ex201901, ex201902, 201912.
Its name is made by ex YYYYMM format.
Anyway,... moreI have lots of excel files(xlsx format) and want to read and handle them.
For example, file names are ex201901, ex201902, 201912.
Its name is made by ex YYYYMM format.
Anyway, to import these files in pandas as an usual case, it's easy.
I'm trying to use scikit-learn's LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder... moreI'm trying to use scikit-learn's LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I'd rather just have one big LabelEncoder objects that works across all my columns of data.Throwing the entire DataFrame into LabelEncoder creates the below error. Please bear in mind that I'm using dummy data here; in actuality I'm dealing with about 50 columns of string labeled data, so need a solution that doesn't reference any columns by name.
import pandas
from sklearn import preprocessing
Traceback (most recent call last): File "", line 1, in File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line 103, in fit y = column_or_1d(y, warn=True) File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 306, in... less
I want to get a list of the column headers from a pandas DataFrame. The DataFrame will come from user input so I won't know how many columns there will be or what they will be... moreI want to get a list of the column headers from a pandas DataFrame. The DataFrame will come from user input so I won't know how many columns there will be or what they will be called.
For example, if I'm given a DataFrame like this:
>>> my_dataframe
y gdp cap
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
5 4 8 3
6 8 2 8
7 9 9 10
8 6 6 4
9 10 10 7
I would get a list like this:
>>> header_list
I have a data set with huge number of features, so analysing the correlation matrix has become very difficult. I want to plot a correlation matrix which we get using... moreI have a data set with huge number of features, so analysing the correlation matrix has become very difficult. I want to plot a correlation matrix which we get using dataframe.corr() function from pandas library. Is there any built-in function provided by the pandas library to plot this matrix?
I have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass... moreI have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass the data to a classifier without the encoding?
I am trying to do the following for feature selection:
I read the train file:
num_rows_to_read = 10000
train_small = pd.read_csv("../../dataset/train.csv", nrows=num_rows_to_read)
I change the type of the categorical features to 'category':
non_categorial_features =
for categorical_feature in list(train_small.columns):
if categorical_feature not in non_categorial_features:
train_small = train_small.astype('category')
I use one hot encoding
train_small_with_dummies = pd.get_dummies(train_small, sparse=True)
The problem is that the 3'rd part often get stuck, although I am using a strong machine.
Thus, without the one hot encoding I can't do any feature selection, for determining the importance of the features.
What do you recommend?