AI in Vision Accessibility

Explore AI's potential in vision accessibility. This talk showcases current assistive tech, real-world workflows, and promising generative AI applications for the visually impaired, including braille transcription and OCR projects.

NVDA VoiceOver VLM Gemini OCR Screen readers (NVDA, VoiceOver)

Overview

This demo will showcase the current assistive technology landscape and discuss potential promising applications of AI in accessibility for the visually impaired. As a student developer with severe visual impairment, I’ll showcase real sample workflows using the technologies I employ for computer use, navigation, and more, from screen readers to vision language models. Based on these, I’ll talk about some potential areas of application or improvements generative AI might be able to do to enhance accessibility.
In addition, I’ll also demo two braille-related projects I worked on: automatic braille-to-print transcription of math material; and OCR on Braille documents respectively.

Links

https://github.com/endernoke/braille-ocr
YOLO-based mobile web app translates photographed Braille to Unicode.

Tech stack

NVDA

NVDA is a free, open-source screen reader that provides blind and vision-impaired users with full access to the Windows operating system.

Developed by the Australian charity NV Access, NonVisual Desktop Access (NVDA) is a high-performance, open-source screen reader that enables over 250,000 blind and vision-impaired individuals worldwide to navigate Windows computers. The software translates on-screen information into synthetic speech or Braille, supporting major applications like Google Chrome, Microsoft Office, and Mozilla Firefox right out of the box. Because it runs entirely on donations and community contributions, NVDA bypasses the high licensing fees of proprietary alternatives (saving users thousands of dollars) and can even run portably from a USB drive to ensure accessibility on any workstation.

https://www.nvaccess.org

View projects
VoiceOver

Apple's gesture-based screen reader built directly into iOS, macOS, watchOS, and tvOS to provide spoken descriptions of on-screen elements.

VoiceOver is Apple's native, gesture-based screen reader that integrates directly with its operating systems (including iOS, macOS, and watchOS) to make devices fully accessible without requiring visual contact. The technology translates on-screen visual elements into spoken descriptions and auditory cues, allowing users to navigate interfaces, select buttons, and read text. By utilizing specialized gestures, keyboard commands, and the customizable Rotor menu, users can easily scan web pages by headings or links, interact with complex layouts, and connect refreshable braille displays for tactile feedback.

https://developer.apple.com/documentation/accessibility/voiceover

View projects
VLM

Vision-Language Models (VLMs) integrate computer vision with natural language processing to let machines see, reason, and communicate about visual data in real time.

VLMs represent the next step in multimodal AI, moving beyond simple image tagging to complex reasoning across visual and textual inputs. These systems typically pair a vision encoder (like CLIP or SigLIP) with a large language model backbone (such as Llama 3 or Qwen 2.5) via a specialized projection layer. This architecture allows the model to perform high-stakes tasks: extracting structured JSON from messy invoices, identifying safety hazards in industrial video feeds, or providing zero-shot image classification without specific retraining. By mapping pixels and tokens into a shared embedding space, VLMs transform static imagery into searchable, conversational, and actionable intelligence.

https://huggingface.co/blog/vlms

View projects
Gemini

Google's natively multimodal AI model: understands and operates across text, code, audio, image, and video.

Gemini is Google's most capable and general AI model, engineered from the ground up to be natively multimodal: it seamlessly understands and combines information across text, code, audio, image, and video inputs. The technology is optimized for flexibility, running efficiently on everything from data centers to mobile devices. It is deployed in three key sizes: Ultra (for highly complex tasks), Pro (for broad scaling), and Nano (for efficient on-device tasks). Developers access this power via the Gemini API to build next-generation applications.

https://deepmind.google/technologies/gemini/

View projects
OCR

Optical Character Recognition (OCR) is the foundational technology that converts typed, printed, or handwritten text from images (scans, JPEGs, PDFs) into machine-readable, searchable data.

OCR is a critical data extraction tool: it transforms non-editable text in digital images into structured, actionable information. The process involves image analysis, character recognition (using pattern matching or feature extraction), and post-processing for accuracy. Modern systems, leveraging AI/ML (Intelligent Character Recognition or ICR), achieve high-accuracy rates, often exceeding 99% on clean documents. Key applications include automating data entry for high-volume documents (invoices, receipts, bank statements), digitizing historical archives for searchability (e.g., Google Books), and real-time functions like license plate recognition (LPR) in traffic systems. This technology cuts manual data entry time and enables powerful text-based analytics.

https://cloud.google.com/vision/docs/ocr

View projects
Screen readers (NVDA, VoiceOver)

Essential assistive technologies that translate on-screen text and interface elements into synthetic speech or braille output.

Screen readers are critical assistive technologies that enable blind and low-vision individuals to navigate digital environments. NonVisual Desktop Access (NVDA) is a free, open-source screen reader built for Microsoft Windows by the Australian charity NV Access, supporting over 55 languages. VoiceOver is Apple's proprietary screen reader integrated directly into macOS, iOS, and iPadOS, offering seamless system-level navigation and gesture-based controls. Both tools parse underlying code (such as HTML landmarks and ARIA attributes) to read text, describe images, and announce interactive elements, making digital accessibility testing and compliance a necessity for modern software development.

https://www.nvaccess.org

View projects