QBoard » Statistical modeling » Stats - Tech » How to extract particular field from text field

How to extract particular field from text field

  • I am trying to extract Experience field from the text field. But after converting PDF to Text file there appears few extra lines due to which I am not able to extract the data properly. Below is the text field yielded after the conversion. Can someone please tell me how to extract the Experience field from this file?

    The below code works perfectly for those text files where there will be no blank lines.

    with open('E:/cvparser/sampath.txt', 'r', encoding = 'utf-8') as f:
        exp_summary_flag = False
        exp_summary = ''
        for line in f:
            if line.startswith('EXPERIENCE'):
                exp_summary_flag = True
            elif exp_summary_flag:
                exp_summary += line
                if not line.strip(): break
    
    print(exp_summary)

     

    Here is the text file which I got after conversion using pdfminer.

    Sampath XYZ 
    
    8th Semester Undergraduate | Computer Science Engineering | UCE RTU, Kota 
    
    +91 654876352 | ABCDEFG@gmail.com | 7/108, Malviya Nagar Jaipur (302017) 
    
    SUMMARY 
    
    
    
    To seek an opportunity to apply my technology expertise along with my creative problem solving skills in an 
    innovative software company. 
    
    
    
    EXPERIENCE 
    
    
    
    
    
      Machine Learning Engineering Intern , Forsk Technologies , Jaipur  (May,2017 – July,2017)     
    
    Learned the foundational concepts of data science and machine learning including python and statistics, 
    enough time was spent on understanding the concept behind each algorithm and examples and case 
    studies were done. Built some mid-scaled machine learning models using supervised and unsupervised 
    learning. 
    
      Software Engineering Intern , Proxbotics Creations Technologies , Jaipur (May,2016 – July,2016) 
    
    Developed  and  optimized  various  projects  including  ecommerce,  booking  &  reservation,  non-profit 
    organization Websites, using technologies: HTML, CSS, PHP, JavaScript, MySQL etc.                          
    
      Trainee at TecheduSoft , Kota  (May,2015) 
    
    The course contains 15+ modules including Android Basics, fragments, screen designing, intents, various 
    views, signing app, web servers, web services, notifications, etc.                                                       
    
    PROJECTS 
    
    All projects are available on git: https://github.com/JAIJANYANI 
    
      Video Analysis for surveillance  
    
    -A command line app which takes all your CCTV feeds as input and filters feeds with abnormal events 
    which results in 90% less videos to watch, Used image processing and deep learning algorithms, 
    outputs all time-stamps of interesting events for all feeds. 
    
      Food Calorie Estimator 
    
    -An android app to estimate calories present in food with still image. Trained own Data-set (Meal-net) 
    using Transfer learning Built upon Inception V3, Proposed a Deep Convolutional Neural Network (CNN) 
    with 48 Layers, Developed a REST API to integrate it in Mobile apps, Optimized total computation time 
    ~ 2 Seconds. 
    
      CryptoCurrency Market Predictor 
    
    - A Flask app to predict the future prices of various Crypto Currencies, implemented various supervised 
    and deep learning algorithms such as LSTM (RNN), polynomial regression, using scikit-learn, tensorflow, 
    keras etc.  
    
      Spam Filter 
    
    -A REST API to Detect Incoming SMS or Email as Spam or Ham which can be trained on your own data 
    set. Used NLP with Naive Bayes for Sentiment Analysis. 
    
    
     
    
    Image Classifier using CNN 
    -An application which detects objects present in a still image, implemented convolutional neural 
    network using open source machine learning library which can be run on multiple machines to reduce 
    training workloads, classifies objects using pre-trained image-net model. 
    
      Online Student and Faculty Portal 
    
    -A Web Portal to manage attendance of students and faculties, can be integrated to mobile apps. Uses 
    Php, MySQL, HTML, CSS, JavaScript, etc. 
    
      Tax Accounting 
    
    -A Decentralized web app built on Ethereum Block-Chain using Truffle and Embark framework, which 
    can be used to transfer funds between accounts which automatically deducts tax from the account. 
    
    
    
    TECHNICAL SKILLS 
    
    Programming Languages 
    
    Web Technologies  
    
    
    
    Scripting Languages     
    
    
    
    
    
    
    
    Database Management System  
    
    
    
    Operating Systems  
    
    Strongest Areas 
    
    
    
    COURSES 
    
    
    
    
    
    
    
    : 
    
    : 
    
    : 
    
    : 
    
    : 
    
    C, C++ 
    
    HTML, CSS 
    
    Python, PHP, BASH 
    
    MySQL, SQLite 
    
    Microsoft Windows, Linux, UNIX 
    
                 :  
    
    Machine Learning, Data Science 
    
    Applied  Machine  Learning  ,  Applied  Data  Science  ,  Exploratory  Data  Analysis  &  Data  Visualization  ,  Neural 
    Networks & Deep Learning , Computer networks , Data Structures & Algorithms , Operating Systems , Cloud 
    Computing , Data Mining , Block chain Essentials , Database Management Systems. 
    
    
    
    EDUCATION 
    
      University College of Engineering , Kota : Btech (Pursuing) in Computer Science Engineering  (2018) 
      St. Edmunds School , Jaipur : Senior Secondary (XII) Education Rajasthan  (2012) 
      St. Edmunds School , Jaipur : Secondary (X) Education Rajasthan  (2010) 

     

    How can I extract experience from this text file?

     
      September 1, 2021 2:05 PM IST
    0
  • As per your code, it didn't works when there is blank lines between EXPERIENCE and rest of the content, because "if not line.strip(): breaks" exit the loop. You must need a specific identifier which break and exit the loop.
    May be something like as below, I tried with my personal CV and try to extract experience summary. I have provided 'Technical Expertise' as end point.
    from docx import Document document = Document(r'cv.docx') exp_summary_flag = False exp_summary = '' for p in document.paragraphs: if p.text == 'Experience Summary': exp_summary_flag = True elif p.text == 'Technical Expertise': break elif exp_summary_flag: print(p.text)
    For more generic solution, it will better to convert it into XML and read the specific tag, so that you don't need any end point identifier.
      September 3, 2021 5:32 PM IST
    0
  • It seems you want to extract data from a CV. This is a complex issue, one cannot give an answer here, it would be too long. But I shall suggest you some hints that might help you.

    First of all, you should transform the PDF into a json or XML, and not into text, which are formats that provide more information, such as position in the page of a word, paragraph or sequence of words, font, etc. Try to use this information in order to extract the data you wish. For example, a subtitle like "Experience" would more probably have a special font, different from the font used for the paragraphs, so you might use the font name/size in order to extract subtitles. Sometimes, subtitles may have a special background color too, you might use that as well.

    You may also use the most common font (let us say the font of the most amount of occurences in the text), and position in the page in order to extract paragraphs. Note that every word / sequence of words in the jSon or XML have attributes (x, y, height, width) which can be used to seek interline, tabulations, text columns, etc.

    Hoping that this would be useful.

      September 4, 2021 12:46 PM IST
    0
  • I Tried with below awk method and it worked fine

    j=`awk '{print NF}' filename `
    for ((i=1;i<=$j;i++)); do awk -v i="$i" '$i ~ /EventCorrelationId/||$i ~ /CreationTime/||$i ~ /SubscriberNumber/{print $i}' filename ; done

     

    output

    EventCorrelationId="615-493|-1899671563||1550927718000"
    CreationTime="20190225094504"
    SubscriberNumber=9270507336
      October 8, 2021 1:16 PM IST
    0