How can I extract experience from this text file?
I am trying to extract Experience field from the text field. But after converting PDF to Text file there appears few extra lines due to which I am not able to extract the data properly. Below is the text field yielded after the conversion. Can someone please tell me how to extract the Experience field from this file?
The below code works perfectly for those text files where there will be no blank lines.
with open('E:/cvparser/sampath.txt', 'r', encoding = 'utf-8') as f:
exp_summary_flag = False
exp_summary = ''
for line in f:
if line.startswith('EXPERIENCE'):
exp_summary_flag = True
elif exp_summary_flag:
exp_summary += line
if not line.strip(): break
print(exp_summary)
Here is the text file which I got after conversion using pdfminer.
Sampath XYZ
8th Semester Undergraduate | Computer Science Engineering | UCE RTU, Kota
+91 654876352 | ABCDEFG@gmail.com | 7/108, Malviya Nagar Jaipur (302017)
SUMMARY
To seek an opportunity to apply my technology expertise along with my creative problem solving skills in an
innovative software company.
EXPERIENCE
Machine Learning Engineering Intern , Forsk Technologies , Jaipur (May,2017 – July,2017)
Learned the foundational concepts of data science and machine learning including python and statistics,
enough time was spent on understanding the concept behind each algorithm and examples and case
studies were done. Built some mid-scaled machine learning models using supervised and unsupervised
learning.
Software Engineering Intern , Proxbotics Creations Technologies , Jaipur (May,2016 – July,2016)
Developed and optimized various projects including ecommerce, booking & reservation, non-profit
organization Websites, using technologies: HTML, CSS, PHP, JavaScript, MySQL etc.
Trainee at TecheduSoft , Kota (May,2015)
The course contains 15+ modules including Android Basics, fragments, screen designing, intents, various
views, signing app, web servers, web services, notifications, etc.
PROJECTS
All projects are available on git: https://github.com/JAIJANYANI
Video Analysis for surveillance
-A command line app which takes all your CCTV feeds as input and filters feeds with abnormal events
which results in 90% less videos to watch, Used image processing and deep learning algorithms,
outputs all time-stamps of interesting events for all feeds.
Food Calorie Estimator
-An android app to estimate calories present in food with still image. Trained own Data-set (Meal-net)
using Transfer learning Built upon Inception V3, Proposed a Deep Convolutional Neural Network (CNN)
with 48 Layers, Developed a REST API to integrate it in Mobile apps, Optimized total computation time
~ 2 Seconds.
CryptoCurrency Market Predictor
- A Flask app to predict the future prices of various Crypto Currencies, implemented various supervised
and deep learning algorithms such as LSTM (RNN), polynomial regression, using scikit-learn, tensorflow,
keras etc.
Spam Filter
-A REST API to Detect Incoming SMS or Email as Spam or Ham which can be trained on your own data
set. Used NLP with Naive Bayes for Sentiment Analysis.
Image Classifier using CNN
-An application which detects objects present in a still image, implemented convolutional neural
network using open source machine learning library which can be run on multiple machines to reduce
training workloads, classifies objects using pre-trained image-net model.
Online Student and Faculty Portal
-A Web Portal to manage attendance of students and faculties, can be integrated to mobile apps. Uses
Php, MySQL, HTML, CSS, JavaScript, etc.
Tax Accounting
-A Decentralized web app built on Ethereum Block-Chain using Truffle and Embark framework, which
can be used to transfer funds between accounts which automatically deducts tax from the account.
TECHNICAL SKILLS
Programming Languages
Web Technologies
Scripting Languages
Database Management System
Operating Systems
Strongest Areas
COURSES
:
:
:
:
:
C, C++
HTML, CSS
Python, PHP, BASH
MySQL, SQLite
Microsoft Windows, Linux, UNIX
:
Machine Learning, Data Science
Applied Machine Learning , Applied Data Science , Exploratory Data Analysis & Data Visualization , Neural
Networks & Deep Learning , Computer networks , Data Structures & Algorithms , Operating Systems , Cloud
Computing , Data Mining , Block chain Essentials , Database Management Systems.
EDUCATION
University College of Engineering , Kota : Btech (Pursuing) in Computer Science Engineering (2018)
St. Edmunds School , Jaipur : Senior Secondary (XII) Education Rajasthan (2012)
St. Edmunds School , Jaipur : Secondary (X) Education Rajasthan (2010)
How can I extract experience from this text file?
from docx import Document document = Document(r'cv.docx') exp_summary_flag = False exp_summary = '' for p in document.paragraphs: if p.text == 'Experience Summary': exp_summary_flag = True elif p.text == 'Technical Expertise': break elif exp_summary_flag: print(p.text)
It seems you want to extract data from a CV. This is a complex issue, one cannot give an answer here, it would be too long. But I shall suggest you some hints that might help you.
First of all, you should transform the PDF into a json or XML, and not into text, which are formats that provide more information, such as position in the page of a word, paragraph or sequence of words, font, etc. Try to use this information in order to extract the data you wish. For example, a subtitle like "Experience" would more probably have a special font, different from the font used for the paragraphs, so you might use the font name/size in order to extract subtitles. Sometimes, subtitles may have a special background color too, you might use that as well.
You may also use the most common font (let us say the font of the most amount of occurences in the text), and position in the page in order to extract paragraphs. Note that every word / sequence of words in the jSon or XML have attributes (x, y, height, width) which can be used to seek interline, tabulations, text columns, etc.
Hoping that this would be useful.
I Tried with below awk method and it worked fine
j=`awk '{print NF}' filename `
for ((i=1;i<=$j;i++)); do awk -v i="$i" '$i ~ /EventCorrelationId/||$i ~ /CreationTime/||$i ~ /SubscriberNumber/{print $i}' filename ; done
output
EventCorrelationId="615-493|-1899671563||1550927718000"
CreationTime="20190225094504"
SubscriberNumber=9270507336