QBoard » Advanced Visualizations » Viz - Python » Introduction to Data Science in Python problem

Introduction to Data Science in Python problem

  • Can any one tell my what that part (town = thisLine[:thisLine.index('(')-1])exactly do?

    def get_list_of_university_towns():
    '''Returns a DataFrame of towns and the states they are in from the 
    university_towns.txt list. The format of the DataFrame should be:
    DataFrame( [ ["Michigan", "Ann Arbor"], ["Michigan", "Yipsilanti"] ], 
    columns=["State", "RegionName"]  )
    
    The following cleaning needs to be done:
    1. For "State", removing characters from "[" to the end.
    2. For "RegionName", when applicable, removing every character from " (" to the end.
    3. Depending on how you read the data, you may need to remove newline character '\n'. '''
    
    data = []
    state = None
    state_towns = []
    with open('university_towns.txt') as file:
        for line in file:
            thisLine = line[:-1]
            if thisLine[-6:] == '[edit]':
                state = thisLine[:-6]
                continue
            if '(' in line:
                town = thisLine[:thisLine.index('(')-1]
                state_towns.append([state,town])
            else:
                town = thisLine
                state_towns.append([state,town])
            data.append(thisLine)
    df = pd.DataFrame(state_towns,columns = ['State','RegionName'])
    return df

     

    get_list_of_university_towns()

     
      October 1, 2021 1:32 PM IST
    0
  • This line does the part of requirement 2 of the list of things cleaned up:

    For example: if Line is:

    line = "Michigan, (Ann Arbor"
    

     

    Then your code will output Michigan,

     
      November 27, 2021 10:35 AM IST
    0
  • It performs this step:

    2. For "RegionName", when applicable, removing every character from " (" to the end.
    

     

    An index of -1 means the end of an array or list.

     
      December 27, 2021 12:14 PM IST
    0
  • import re
    raw_data=open('university_towns.txt','r')
    data=raw_data.readlines()
    raw_data.close()
    subs='[edit]'
    state=''
    region=''
    df=pd.DataFrame(columns=('State','RegionName'))
    
    for line in data:
        line.rstrip()
        if subs in line:
            state=line.replace(subs,'')
        else:
            region=re.sub(r" \(.*",'',line)
            df=df.append({'State':state,'RegionName':region},ignore_index=True)
    
    df=df.replace('\n','',regex=True)
    df
      October 26, 2021 12:45 PM IST
    0