Jun 20, 2024

OpenAI Vision

It has been almost a decade since I worked with image processing technology. I took a graduate level class in 2D/3D Image Processing. It involved a complex level of Calculus that I shudder to remember, but also some real application code with OpenCV. This was a time before I knew anything about Git, so I had to go find my old college laptop (and hope it would turn on) to find the code as a reference. Luckily, it turned on, but if I bumped the power cable…black screen. After getting my GitHub SSH key setup, and finding out that the existing key on the machine that I had originally tried to use was not the right format, I had the code up in GitHub. Now I could reference it from my primary computer, and boy was it a time warp.

A quick aside. While I was looking at some of those old files and code, I also stumbled upon some other code I’d written. Specifically, a dungeon crawler game in JavaScript and HTML for a game design class I had taken. It is probably the closest thing to anything I did in college to what I do now. I’m not writing games, but it was my first “web development” project technically. I’m mostly using Node.js, Python, TypeScript, React, and other higher level frameworks today, but that isn’t the point. What is cool about seeing some of that old code is realizing what I didn’t know then that I know now about how complex software systems work. I didn’t really understand some of how a browser works. Or how JavaScript files worked when imported into HTML script tags and then using them downstream as dependencies. I just figured out a way to make it work. But it didn’t matter. I didn’t need to understand everything. Like I mentioned in my previous post about becoming a beginner again, there is something I miss about that beauty of naivete. The ability to “just do it” without any regard for how the code would look, if I was using the right design patterns, or if I was using the right framework/tooling. I just built a game in JavaScript and added features I thought were cool (admittedly I was a good student, so the thought fear of not getting a good grade was likely the primary motivating factor that pushed me to just get it done). The rest didn’t matter. Anyway, back to image processing.

What I had built back in college was a Magic: The Gathering (MTG) card detector. There were two components to it, a card database and a video feed. The idea was that we would load the card database and use the SURF algorithm via OpenCV to detect the “interest points” of the images. Then, we would take frames from the live video feed, run it through the same algorithm, and compute the match percentage to each image in the card database. If we found a match, we would show the data from the card database about the card. The code was all in one file, unorganized, and ugly, but it got the job done (again naivete). Also, there is no way it would scale well as the size of the card database grew. My sample dataset had like 5 images in it. No way the simple for loop would work with the tens of thousands of cards in MTG. Though I could think of a few ways to make it much easier with some different database storage options (but that’s not the point of this post). Did I mention this was all in C++? I haven’t written anything in that language since college, but I understand enough about programming now to read it fairly easily. What stood out most to me is the complexity and performance issues. Sure, some of that was because a very junior engineer wrote it, but you just have to know the terminology. SURF, FlannBasedMatcher, Homography, and RANSAC are just a few of the keywords that I saw when looking over it that I had to look up again to remember what they meant. I went “oh yeah” when I read them, but still had no idea what they meant. That is some deep image processing knowledge (mostly math) I haven’t had to think about for a long time. And honestly, it’s getting easier now with AI picking up steam.

With the latest wave of AI advancements, we’ve really made strides in multiple mediums. Text, image, and video have been at the forefront of those innovations. Specifically, OpenAI’s Vision API makes image processing (and video processing if you just process the frames) so easy. You don’t get the fine-grained control of using something like OpenCV directly, but you can gather so much information from an image without needing to know any math or image processing domain knowledge. That is incredibly empowering for engineers, even though it’s not perfect by any means (and even has known limitations). It is a predictive algorithm after all, so your results can be non-deterministic even when processing the same image. On the other hand, instead of coding up a bunch of mathematical equations to get what you want, you can just explain it in English (or your native language). The ability to get started and prove out an idea before diving in really deep with image processing techniques can save you so much time (at the cost of OpenAI credits). And I can only assume that it will get better. I still believe that AI isn’t a stable building block, but this programming interface pales in comparison to what I had to work with back in college. The engineering world sure is changing.

OpenAI Vision

Stay in the Loop