Quantcast
Channel: Data Science, Analytics and Big Data discussions - Latest topics
Viewing all articles
Browse latest Browse all 4448

How to read pdf, excels and word documents using python

$
0
0

@harshalpatil.nmu wrote:

I am working on indexing and searching documents. I have saved all the .docx, pdf, .xls and ppt files in a folder named datasets. I want to extract information from all documents for indexing as well as to clean files using basic nltk task. To do this I explored textract but it does not work. could you help me to find solution.

I just read document from directory using os.listdir function as below

root = “D:\Harshal\search”
path = os.path.join(root, “datasets”)

f= open(“filenames1.txt”,“w+”)
i=0
for path, subdirs, files in os.walk(root):
for name in os.listdir(path):
i = i + 1
f.write( str(i) + “,” + str(name.encode(“utf-8”)) + “\n”)

f.close()

Posts: 1

Participants: 1

Read full topic


Viewing all articles
Browse latest Browse all 4448

Trending Articles