PyOhio | Presentation: Building a world class document pipeline using Python

Sunday 1:30 p.m.–2:20 p.m.

Building a world class document pipeline using Python

Andrew Wolfe

Audience level:: Intermediate

Description

At Brokersavant, we process large quantities of real estate assets ranging from commercial property flyers to large real estate leases and our customers expect a lightning fast turn around. Learn how we leveraged open source technologies and Python libraries to create a system that scales to millions of assets per day without missing a beat.

Abstract

Processing documents isn't just about loading them using file() and extracting the text right from the document. Bad scans, images, mis-spellings, foreign languages, hundreds of document/image types and other reasons prevent us from taking the easy route to processing document assets we require in our software systems. In this talk, We'll dive into some practices I've learned from solving real world problems extracting documents such as leases, flyers and real estate comparison sheets from various global corporations and fortune 100 companies at scale. We will discuss the following topics that will help take your document processing to the next level:

Creating an asset pipeline using Celery, Redis, Docker, and Amazon s3
Using Elastic and NLTK to extract meaning from the document and make decisions within your pipeline
Protecting against edge cases in our pipeline
Using open source technologies to standardize documents into a standard input type
Creating a smarter engine using sklearn.
Turning Errors into Improvements