How to extract particular field from text field

QBoard » Statistical modeling » Stats - Tech » How to extract particular field from text field

User Dashboard

How to extract particular field from text field

Back To Topics

Tag : python-3.x

Vaibhav Mali

259

I am trying to extract Experience field from the text field. But after converting PDF to Text file there appears few extra lines due to which I am not able to extract the data properly. Below is the text field yielded after the conversion. Can someone please tell me how to extract the Experience field from this file?

The below code works perfectly for those text files where there will be no blank lines.

with open('E:/cvparser/sampath.txt', 'r', encoding = 'utf-8') as f:
    exp_summary_flag = False
    exp_summary = ''
    for line in f:
        if line.startswith('EXPERIENCE'):
            exp_summary_flag = True
        elif exp_summary_flag:
            exp_summary += line
            if not line.strip(): break

print(exp_summary)

Here is the text file which I got after conversion using pdfminer.

Sampath XYZ 

8th Semester Undergraduate | Computer Science Engineering | UCE RTU, Kota 

+91 654876352 | ABCDEFG@gmail.com | 7/108, Malviya Nagar Jaipur (302017) 

SUMMARY 



To seek an opportunity to apply my technology expertise along with my creative problem solving skills in an 
innovative software company. 



EXPERIENCE 





  Machine Learning Engineering Intern , Forsk Technologies , Jaipur  (May,2017 – July,2017)     

Learned the foundational concepts of data science and machine learning including python and statistics, 
enough time was spent on understanding the concept behind each algorithm and examples and case 
studies were done. Built some mid-scaled machine learning models using supervised and unsupervised 
learning. 

  Software Engineering Intern , Proxbotics Creations Technologies , Jaipur (May,2016 – July,2016) 

Developed  and  optimized  various  projects  including  ecommerce,  booking  &  reservation,  non-profit 
organization Websites, using technologies: HTML, CSS, PHP, JavaScript, MySQL etc.                          

  Trainee at TecheduSoft , Kota  (May,2015) 

The course contains 15+ modules including Android Basics, fragments, screen designing, intents, various 
views, signing app, web servers, web services, notifications, etc.                                                       

PROJECTS 

All projects are available on git: https://github.com/JAIJANYANI 

  Video Analysis for surveillance  

-A command line app which takes all your CCTV feeds as input and filters feeds with abnormal events 
which results in 90% less videos to watch, Used image processing and deep learning algorithms, 
outputs all time-stamps of interesting events for all feeds. 

  Food Calorie Estimator 

-An android app to estimate calories present in food with still image. Trained own Data-set (Meal-net) 
using Transfer learning Built upon Inception V3, Proposed a Deep Convolutional Neural Network (CNN) 
with 48 Layers, Developed a REST API to integrate it in Mobile apps, Optimized total computation time 
~ 2 Seconds. 

  CryptoCurrency Market Predictor 

- A Flask app to predict the future prices of various Crypto Currencies, implemented various supervised 
and deep learning algorithms such as LSTM (RNN), polynomial regression, using scikit-learn, tensorflow, 
keras etc.  

  Spam Filter 

-A REST API to Detect Incoming SMS or Email as Spam or Ham which can be trained on your own data 
set. Used NLP with Naive Bayes for Sentiment Analysis. 


 

Image Classifier using CNN 
-An application which detects objects present in a still image, implemented convolutional neural 
network using open source machine learning library which can be run on multiple machines to reduce 
training workloads, classifies objects using pre-trained image-net model. 

  Online Student and Faculty Portal 

-A Web Portal to manage attendance of students and faculties, can be integrated to mobile apps. Uses 
Php, MySQL, HTML, CSS, JavaScript, etc. 

  Tax Accounting 

-A Decentralized web app built on Ethereum Block-Chain using Truffle and Embark framework, which 
can be used to transfer funds between accounts which automatically deducts tax from the account. 



TECHNICAL SKILLS 

Programming Languages 

Web Technologies  



Scripting Languages     







Database Management System  



Operating Systems  

Strongest Areas 



COURSES 







: 

: 

: 

: 

: 

C, C++ 

HTML, CSS 

Python, PHP, BASH 

MySQL, SQLite 

Microsoft Windows, Linux, UNIX 

             :  

Machine Learning, Data Science 

Applied  Machine  Learning  ,  Applied  Data  Science  ,  Exploratory  Data  Analysis  &  Data  Visualization  ,  Neural 
Networks & Deep Learning , Computer networks , Data Structures & Algorithms , Operating Systems , Cloud 
Computing , Data Mining , Block chain Essentials , Database Management Systems. 



EDUCATION 

  University College of Engineering , Kota : Btech (Pursuing) in Computer Science Engineering  (2018) 
  St. Edmunds School , Jaipur : Senior Secondary (XII) Education Rajasthan  (2012) 
  St. Edmunds School , Jaipur : Secondary (X) Education Rajasthan  (2010)

How can I extract experience from this text file?

September 1, 2021 2:05 PM IST

Sindhuja Martha

181

As per your code, it didn't works when there is blank lines between EXPERIENCE and rest of the content, because "if not line.strip(): breaks" exit the loop. You must need a specific identifier which break and exit the loop.

May be something like as below, I tried with my personal CV and try to extract experience summary. I have provided 'Technical Expertise' as end point.

from docx import Document document = Document(r'cv.docx') exp_summary_flag = False exp_summary = '' for p in document.paragraphs: if p.text == 'Experience Summary': exp_summary_flag = True elif p.text == 'Technical Expertise': break elif exp_summary_flag: print(p.text)

References : Reading .docx files in Python to find strikethrough, bullets and other formats

For more generic solution, it will better to convert it into XML and read the specific tag, so that you don't need any end point identifier.

References : Extracting specific xml tag value using python https://www.tutorialspoint.com/How-to-get-specific-nodes-in-xml-file-in-Python

September 3, 2021 5:32 PM IST

0
Samar Patil

346 3

It seems you want to extract data from a CV. This is a complex issue, one cannot give an answer here, it would be too long. But I shall suggest you some hints that might help you.

First of all, you should transform the PDF into a json or XML, and not into text, which are formats that provide more information, such as position in the page of a word, paragraph or sequence of words, font, etc. Try to use this information in order to extract the data you wish. For example, a subtitle like "Experience" would more probably have a special font, different from the font used for the paragraphs, so you might use the font name/size in order to extract subtitles. Sometimes, subtitles may have a special background color too, you might use that as well.

You may also use the most common font (let us say the font of the most amount of occurences in the text), and position in the page in order to extract paragraphs. Note that every word / sequence of words in the jSon or XML have attributes (x, y, height, width) which can be used to seek interline, tabulations, text columns, etc.

Hoping that this would be useful.

September 4, 2021 12:46 PM IST

0

Maryam Bains

317

I Tried with below awk method and it worked fine

j=`awk '{print NF}' filename `
for ((i=1;i<=$j;i++)); do awk -v i="$i" '$i ~ /EventCorrelationId/||$i ~ /CreationTime/||$i ~ /SubscriberNumber/{print $i}' filename ; done

output

EventCorrelationId="615-493|-1899671563||1550927718000"
CreationTime="20190225094504"
SubscriberNumber=9270507336

October 8, 2021 1:16 PM IST

Cluzters.ai

Cluzters.ai is the first step towards uniting various Industry participants in the field of Applied Data Innovations. It is a gamified community geared towards creating a level playing turf for Data science professionals.

Member Sign In

Member Sign In

Create Account

How to extract particular field from text field

Connect With Us