QBoard » Artificial Intelligence & ML » AI and ML - Python » Unable to compare 2 Python sets that contains strings

Unable to compare 2 Python sets that contains strings

  • I have created 2 python sets created from 2 different CSV files which contains some stings.

    I am trying to match the 2 sets so that it will return an intersection of the 2 (the common strings from both the sets should be returned).

    This is how my code looks:

    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    import string
    import nltk
    #using content mmanager to open and read file
    #converted the text file into csv file at the source using Notepad++
    with open(r'skills.csv', 'r', encoding="utf-8-sig") as f:
        myskills = f.readlines()
        #converting mall the string in the list to lowercase
        list_of_myskills = map(lambda x: x.lower(), myskills)
        set_of_myskills = set(list_of_myskills)
        #print(type(nodup_filtered_content))
    print(set_of_myskills)
    #open and read by line from the text file
    with open(r'list_of_skills.csv', 'r') as f2:
        #using readlines() instead of read(), becasue it reads line by line (each 
        line as a string obj in the python list)
        contents_f2 = f2.readlines()
        #converting mall the string in the list to lowercase
        list_of_skills = map(lambda x: x.lower(), contents_f2)
        #converting into sets
        set_of_skills = set(list_of_skills)
    print(set_of_skills)

     

    And this is the function that I am using:

    def set_compare(set1,set2):
    if(set1 & set2):
        return print('The matching skills are: '(set1 & set2))
    else:
        print("No matching skills")

     

    After I run the code:

        set_compare(set_of_skills,set_of_myskills)
    

     

    Output:

    No matching skills
    

    The contents of the 'skills.csv' is:

    {'critical thinking,identify user needs,business intelligence,business analysis,teamwork,database,data visualization,data analysis,relational database,mysql,oracle sql,design,entity-relationship,develop ,use-cases ,scenarios,project development ,user requirement,design,sequence diagram,state diagram,identifying,uml diagrams,html5,css3,php,clean,analyze,plot,data,python,pandas,numpy,matplotlib,ipython notebook,spyder,anaconda,jupyterlab,data analysis,data visualization,tableau,database,surveys,prototyping,logical data models,data models,requirement elicitation.,leadreship,mysq,team,prioratization,analyze,articulate,'}
    


    Content of the file 'list_of_skills.csv':

    {'assign passwords and maintain database access,agile development,agile project methodology,amazon web services (aws),analytics,analytical,analyze and recommend database improvements,analyze impact of database changes to the business,audit database access and requests,apis,application and server monitoring tools,applications,application development,attention to detail,architecture,big data,business analytics,business intelligence,business process modeling,cloud applications,cloud based visualizations,cloud hosting services,cloud maintenance tasks,cloud management tools,cloud platforms,cloud scalability,cloud services,cloud systems administration,code,coding,computer,communication,configure database software,configuration,configuration management,content strategy,content management,continually review processes for improvement ,continuous deployment,continuous integration,critical thinking,customer support,database,data analysis,data analytics,data imports,data imports,data intelligence,data mining,data modeling,data science,data strategy,data storage,data visualization tools,data visualizations,database administration,deploying applications in a cloud environment,deployment automation tools,deployment of cloud services,design,desktop support,design,design and build database management system,design principles,design prototypes,design specifications,design tools,develop and secure network structures,develop and test methods to synchronize data ,developer,development,documentation,emerging technologies,file systems,flexibility,front end design,google analytics,hardware,help desk,identify user needs ,implement backup and recovery plan ,implementation,information architecture,information design,information systems,interaction design,interaction flows,"install, maintain, and merge databases ",installation,integrated technologies,integrating security protocols with cloud design,internet,it optimization,it security,it soft skills,it solutions,it support,languages,logical thinking,leadership,linux,management,messaging,methodology,metrics,microsoft office,migrating existing workloads into cloud systems,mobile applications,motivation,networks,network operations,networking,open source technology integration,operating systems,operations,optimize queries on live data,optimizing user experiences,optimizing website performance,organization,presentation,programming,problem solving,process flows,product design,product development,prototyping methods,product development,product management,product support,product training,project management,repairs,reporting,research emerging technology,responsive design,review existing solutions,search engine optimization (seo),security,self motivated,self starting,servers,software,software development,software engineering,software quality assurance (qa),solid project management capabilities ,solid understanding of company’s data needs ,storage,strong technical and interpersonal communication ,support,systems software,tablets,team building,team oriented,teamwork,technology,tech skills,technical support,technical writing,testing,time management,tools,touch input navigation,training,troubleshooting,troubleshooting break-fix scenarios,user research,user testing,usability,user-centered design,user experience,user flows,user interface,user interaction diagrams,user research,user testing,ui / ux,utilizing cloud automation tools,virtualization,visual design,web analytics,web applications,web development,web design,web technologies,wireframes,work independently,'}
    

    Although I can physically see the matching keywords, I don't understand why am I not getting the output.

    Not getting any errors either

     
      December 4, 2021 1:32 PM IST
    0
  • This works: Change .csv files to contain the skills' words separated by ",". One line per file.

    import pandas as pd
    myskills = pd.read_csv("skills.csv",header=None)
    set_of_my_skills = set(myskills.iloc[0,])
    list_of_skills = pd.read_csv("list_of_skills.csv",header=None)
    set_of_skills = set(list_of_skills.iloc[0,])
    print(set_of_my_skills & set_of_skills)
    
    {'business intelligence', 'design', 'critical thinking', 'data analysis', 'database', 'teamwork'}
    
    skills.csv : critical thinking,identify user needs,business intelligence,business analysis,teamwork,database,data visualization,data analysis,relational database,mysql,oracle sql,design,entity-relationship,develop ,use-cases ,scenarios,project development ,user requirement,design,sequence diagram,state diagram,identifying,uml diagrams,html5,css3,php,clean,analyze,plot,data,python,pandas,numpy,matplotlib,ipython notebook,spyder,anaconda,jupyterlab,data analysis,data visualization,tableau,database,surveys,prototyping,logical data models,data models,requirement elicitation.,leadreship,mysq,team,prioratization,analyze,articulate         
    list_of_skills.csv: assign passwords and maintain database access,agile development,agile project methodology,amazon web services (aws),analytics,analytical,analyze and recommend database improvements,analyze impact of database changes to the business,audit database access and requests,apis,application and server monitoring tools,applications,application development,attention to detail,architecture,big data,business analytics,business intelligence,business process modeling,cloud applications,cloud based visualizations,cloud hosting services,cloud maintenance tasks,cloud management tools,cloud platforms,cloud scalability,cloud services,cloud systems administration,code,coding,computer,communication,configure database software,configuration,configuration management,content strategy,content management,continually review processes for improvement ,continuous deployment,continuous integration,critical thinking,customer support,database,data analysis,data analytics,data imports,data imports,data intelligence,data mining,data modeling,data science,data strategy,data storage,data visualization tools,data visualizations,database administration,deploying applications in a cloud environment,deployment automation tools,deployment of cloud services,design,desktop support,design,design and build database management system,design principles,design prototypes,design specifications,design tools,develop and secure network structures,develop and test methods to synchronize data ,developer,development,documentation,emerging technologies,file systems,flexibility,front end design,google analytics,hardware,help desk,identify user needs ,implement backup and recovery plan ,implementation,information architecture,information design,information systems,interaction design,interaction flows,"install, maintain, and merge databases ",installation,integrated technologies,integrating security protocols with cloud design,internet,it optimization,it security,it soft skills,it solutions,it support,languages,logical thinking,leadership,linux,management,messaging,methodology,metrics,microsoft office,migrating existing workloads into cloud systems,mobile applications,motivation,networks,network operations,networking,open source technology integration,operating systems,operations,optimize queries on live data,optimizing user experiences,optimizing website performance,organization,presentation,programming,problem solving,process flows,product design,product development,prototyping methods,product development,product management,product support,product training,project management,repairs,reporting,research emerging technology,responsive design,review existing solutions,search engine optimization (seo),security,self motivated,self starting,servers,software,software development,software engineering,software quality assurance (qa),solid project management capabilities ,solid understanding of company’s data needs ,storage,strong technical and interpersonal communication ,support,systems software,tablets,team building,team oriented,teamwork,technology,tech skills,technical support,technical writing,testing,time management,tools,touch input navigation,training,troubleshooting,troubleshooting break-fix scenarios,user research,user testing,usability,user-centered design,user experience,user flows,user interface,user interaction diagrams,user research,user testing,ui / ux,utilizing cloud automation tools,virtualization,visual design,web analytics,web applications,web development,web design,web technologies,wireframes,work independently
      December 7, 2021 1:45 PM IST
    0
  • Comparing two sets of strings will not compare substrings of those strings. What your program is essentially doing is
    foo = {'ABC', 'DEF', 'GHI'} bar = {'AB', 'CD', 'DE', 'FG', 'HI'} foo.intersection(bar) # returns {}
    just because there are characters shared between strings in different sets does not mean the sets have an intersection. The string 'ABC' is in the first not the second, the string 'AB' is in the second not the first, etc.
    It's a bit unclear what exactly you're trying to compare the intersection of between the two csv's. Do you want to find the individual cells that are in both? Do they have to match in columns as well? If you provide some more information about the expected output, then I can edit this answer to provide more information.
    [Edit] Per your comment, it looks like what you want is to split those giant strings on commas so the elements of the sets become individual cells. Currently, those sets each only have one element, each of which is just a single giant string with lots of skills in it. If you replace
    list_of_myskills = map(lambda x: x.lower(), myskills)
    with
    list_of_myskills = [y.strip().lower() for x in myskills for y in x.split(',')]
    and replace the other similar line accordingly, then you will likely be closer to what you're expecting.
      December 7, 2021 5:28 PM IST
    0
  • #String compare in Python example
     
     
     
    str_x = 'Hello & Welcome'
     
    str_y = 'Hello & Welcome'
     
     
     
    #comparing by ==
     
    if str_x == str_y:
     
        print ("Same Strings")
     
    else:
     
        print ("Different Strings")
      December 14, 2021 11:53 AM IST
    0
  • You can sort both:

    sorted(a) == sorted(b)
    

     

    counting sort could also be more efficient (but it requires the object to be hashable).

    >>> from collections import Counter
    >>> a = [1, 2, 3, 1, 2, 3]
    >>> b = [3, 2, 1, 3, 2, 1]
    >>> print (Counter(a) == Counter(b))
    True
      December 15, 2021 12:41 PM IST
    0