Using the Vision Framework for Object Detection in Ios Apps

What Is the Vision Framework?

The Vision framework is Apple’s high-level computer vision library, tightly integrated with the iOS ecosystem. It exposes a set of optimized APIs for tasks such as face detection, text recognition, barcode scanning, and object detection. Under the hood, Vision leverages Core ML to run neural networks efficiently on the GPU or Neural Engine, enabling near‑real‑time performance even on older devices.

Vision abstracts away the complexity of pixel buffers and model management. Instead of writing shaders or manually handling image buffers, you create request objects—like VNDetectRectanglesRequest or VNCoreMLRequest—and pass them to an image handler. The framework handles the rest: cropping, scaling, orientation correction, and inference.

A key benefit of Vision is its ability to work both with Apple’s built‑in detectors and with custom Core ML models. For object detection, you can load a model trained on your own dataset (e.g., YOLOv5, MobileNet SSD) and run it through a VNCoreMLRequest. The framework also provides built‑in rectangle, face, and human body detection out of the box.

To learn more about the Vision framework’s capabilities, refer to Apple’s official Vision documentation.

Setting Up Vision in Your Xcode Project

Adding Vision to an iOS app requires only a few steps. First, import the framework in any file that will use Vision:

import Vision

If you are using a custom Core ML model, ensure the model file (.mlmodel or .mlpackage) is added to your Xcode project. Xcode will automatically generate a Swift class for the model. You will also need camera permission (NSCameraUsageDescription) if you plan to run detection on live video.

Vision does not require explicit machine learning knowledge. However, for custom models, familiarity with Core ML model conversion tools (like coremltools) is helpful.

Adding a Core ML Model for Custom Object Detection

Apple provides a library of pre‑trained models for object detection, including YOLOv3 and MobileNetV2 SSD. You can also convert models from other frameworks (PyTorch, TensorFlow) using coremltools, or use the Create ML app to train a custom detector if you have a labeled dataset.

Once the model is in your project, you load it into a VNCoreMLModel object:

guard let model = try? VNCoreMLModel(for: YourCustomModel().model) else {
    fatalError("Failed to load Core ML model")
}

This model object is then used to create a VNCoreMLRequest, which will handle the inference and return results as VNRecognizedObjectObservation instances.

Performing Object Detection on Static Images

Detecting objects in a still image is the simplest use case. It involves three steps: create a request, run it with an image handler, and process the results.

Creating a VNCoreMLRequest

With your VNCoreMLModel ready, create the request and attach a completion handler:

let request = VNCoreMLRequest(model: model) { request, error in
    guard let results = request.results as? [VNRecognizedObjectObservation] else {
        return
    }
    // Process results
}
// Optional: set the confidence threshold to filter low‑confidence detections
request.confidenceThreshold = 0.3

The completion handler is called on a background queue (by default). You can change this by setting the completionHandlerQueue property, for example to DispatchQueue.main, if you need to update the UI directly from the handler.

Using VNImageRequestHandler

The image handler accepts a CGImage, CIImage, or CVPixelBuffer. For a static image, convert a UIImage to CGImage and pass it:

guard let cgImage = uiImage.cgImage else { return }
let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])
DispatchQueue.global(qos: .userInitiated).async {
    do {
        try handler.perform([request])
    } catch {
        print("Failed to perform request: \(error.localizedDescription)")
    }
}

Notice we dispatch the perform call to a background queue to avoid blocking the UI.

Handling Results: VNRecognizedObjectObservation

Each VNRecognizedObjectObservation contains a bounding box (in normalized coordinates) and a list of possible labels with confidence scores. To extract the top‑scoring label and its bounding box:

for observation in results {
    // Get the highest‑confidence label
    guard let topLabel = observation.labels.first else { continue }
    print("Detected: \(topLabel.identifier) with confidence \(topLabel.confidence)")
    
    // Bounding box in normalized coordinates (origin at bottom‑left)
    let boundingBox = observation.boundingBox
    // Convert to image coordinates using image size
    let rect = VNImageRectForNormalizedRect(boundingBox,
                                            imageWidth, imageHeight)
    // Draw rect on the image or overlay
}

Note that Vision uses a bottom‑left origin coordinate system. To display on an iOS view (top‑left origin), you need to flip the Y‑axis. A common utility method looks like:

func convertNormalizedRect(_ box: CGRect, imageSize: CGSize) -> CGRect {
    return CGRect(
        x: box.origin.x * imageSize.width,
        y: (1 - box.origin.y - box.height) * imageSize.height,
        width: box.width * imageSize.width,
        height: box.height * imageSize.height
    )
}

This conversion ensures bounding boxes align correctly with the image as displayed on screen.

Real‑Time Object Detection with AVFoundation

For live camera feeds, you combine Vision with AVFoundation. The camera delegate delivers video frames as CMSampleBuffer objects; you extract a pixel buffer and pass it to Vision. Performance becomes critical here: you must balance frame rate with detection accuracy.

Setting Up the Camera

Configure an AVCaptureSession with a high‑resolution preset (.hd1920x1080) and a AVCaptureVideoDataOutput. Set the sample buffer delegate on a serial dispatch queue to avoid thread contention:

let session = AVCaptureSession()
session.sessionPreset = .hd1920x1080

guard let device = AVCaptureDevice.default(for: .video),
      let input = try? AVCaptureDeviceInput(device: device) else { return }
session.addInput(input)

let output = AVCaptureVideoDataOutput()
output.setSampleBufferDelegate(self, queue: videoDataOutputQueue)
session.addOutput(output)
session.startRunning()

Processing Video Frames

In the delegate method captureOutput(_:didOutput:from:), convert the CMSampleBuffer to a CVPixelBuffer and create a VNImageRequestHandler. Use a flag to throttle requests—running detection on every frame can overwhelm the device:

func captureOutput(_ output: AVCaptureOutput,
                   didOutput sampleBuffer: CMSampleBuffer,
                   from connection: AVCaptureConnection) {
    guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
    
    // Throttle: only process when the previous request has completed
    guard !isProcessing else { return }
    isProcessing = true
    
    let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer,
                                        orientation: .up,
                                        options: [:])
    do {
        try handler.perform([objectDetectionRequest])
    } catch {
        print("Detection error: \(error)")
    }
}

Set isProcessing = false at the end of your request’s completion handler to enable the next frame. This technique prevents the pipeline from queuing up several requests and keeps the frame rate consistent.

Performance Considerations

Vision is already heavily optimized, but you can further improve performance:

Reduce input resolution: If you don’t need full‑HD detection, set the session preset to .hd1280x720 or even .vga640x480. Smaller images mean less computation.
Use the Neural Engine: Core ML models compiled for iOS 12+ automatically use the ANE (Apple Neural Engine) when available. Verify your model’s metadata to ensure it targets the correct compute units.
Batch frames? Vision does not support batched image handlers natively. Instead, consider using a custom pipeline that downsamples frames and runs detection every N frames (e.g., every third frame) while interpolating bounding boxes between frames.
Model quantization: If you trained your own model, use coremltools to quantize weights (e.g., to 16‑bit float) for faster inference with minimal loss of accuracy.

Handling Detection Results and Visualizing

Drawing detection overlays on the live camera preview is a common requirement. The best approach is to add a UIView sublayer on top of the AVCaptureVideoPreviewLayer and update its bounds as detections arrive.

Drawing Bounding Boxes

In the completion handler, convert each observation’s normalized bounding box to view coordinates, then create a CALayer with a colored border and label:

DispatchQueue.main.async {
    // Remove previous overlays
    self.overlayLayer.sublayers?.forEach { $0.removeFromSuperlayer() }
    
    let viewSize = self.previewView.bounds.size
    for observation in results {
        let box = observation.boundingBox
        // Convert to view coordinates (top‑left origin)
        let rect = CGRect(
            x: box.minX * viewSize.width,
            y: (1 - box.maxY) * viewSize.height,
            width: box.width * viewSize.width,
            height: box.height * viewSize.height
        )
        let layer = CALayer()
        layer.frame = rect
        layer.borderColor = UIColor.green.cgColor
        layer.borderWidth = 2.0
        
        // Optional: add a text layer for the label
        if let label = observation.labels.first?.identifier {
            let textLayer = CATextLayer()
            textLayer.string = "\(label): \(Int(observation.confidence * 100))%"
            textLayer.fontSize = 14
            textLayer.foregroundColor = UIColor.white.cgColor
            textLayer.frame = CGRect(x: rect.minX, y: rect.minY - 20,
                                     width: rect.width, height: 20)
            self.overlayLayer.addSublayer(textLayer)
        }
        self.overlayLayer.addSublayer(layer)
    }
}

Make sure you update the overlay on the main thread; the Vision completion handler can run on a background queue.

Confidence Thresholds

Different use cases require different confidence levels. A retail product scanner might need a threshold of 0.7+ to avoid false positives, while a fun AR filter could accept 0.2. Set the confidenceThreshold on your request to filter out low‑confidence detections. You can also filter after receiving results if you need per‑observation adjustment.

Best Practices for Production Apps

Deploying an object detection feature to millions of devices requires careful attention to edge cases and device variability.

Image Preprocessing

Vision automatically crops and scales images to the model’s input size, but you can improve consistency:

Use .cgImageOrientation to pass the correct orientation (especially on iPhone, where the camera sensor may be rotated).
Normalize lighting by applying a simple histogram equalization before passing the image, though this may slow performance.
For real‑time use, avoid saving frames to disk; work directly with pixel buffers.

Model Optimization

If you train your own model, keep the following in mind:

Use models designed for mobile: lightweight backbones like MobileNetV2, SqueezeNet, or EfficientNet‑Lite.
Limit the number of object classes to what is strictly necessary. Each additional class adds inference time.
Test on the oldest device you intend to support. iPhone 7 and earlier may struggle with large models. Use Xcode’s Energy Impact and GPU/Neural Engine instruments to identify bottlenecks.

Power Management

Object detection is a power‑intensive task. To avoid draining the battery:

Stop the detection pipeline when the app goes into the background (listen for UIApplication.willResignActiveNotification).
Reduce frame rate when no objects are detected (e.g., run detection every 500 ms instead of every frame).
On devices with a Neural Engine, prefer models compiled for .all compute units so the ANE handles inference, which is more power‑efficient than the GPU.

Advanced Techniques

Once you have basic detection working, you can enhance your app with additional Vision features and integration with other Apple frameworks.

Combining Vision with ARKit

ARKit can provide scene‑understanding information (feature points, plane detection, image anchors). By combining Vision object detection with ARKit, you can attach virtual objects to real‑world items. For example, detect a dog with Vision, then place a 3D model on top of it using ARKit’s ARAnchor. The boundingBox from Vision can be used to estimate the 2D screen position, which ARKit can then project into 3D space using raycasting.

Sequence Requests for Tracking

Vision supports VNSequenceRequestHandler for tracking objects across multiple frames. This is especially useful when you want to maintain a stable ID for each detected object (e.g., tracking a user’s hand movement). After the first detection, you can create a VNTrackObjectRequest and pass it to a VNSequenceRequestHandler in subsequent frames. The tracker will continue to follow the object even if it temporarily disappears.

// For the first frame
let originalRequest = VNCoreMLRequest(...)
handler.perform([originalRequest])
let trackedObject = originalRequest.results?.first as? VNRecognizedObjectObservation

// For subsequent frames
let trackingRequest = VNTrackObjectRequest(detectedObjectObservation: trackedObject!)
sequenceHandler.perform([trackingRequest], on: pixelBuffer)

Note that sequence requests require a model that outputs bounding boxes (like SSD) and work best with objects that move slowly relative to the camera.

Conclusion

The Vision framework enables iOS developers to add powerful object detection capabilities with minimal boilerplate. By understanding the lifecycle of a Vision request—creating it, performing it on an image or video buffer, and interpreting the results—you can build apps that recognize everything from products in a store to wildlife in a nature trail. Combining Vision with Core ML models gives you the flexibility to detect custom objects, while integration with AVFoundation provides real‑time performance. Always profile your app on a range of devices and remember to throttle requests to balance accuracy, speed, and battery life. With these tools, you can create engaging, intelligent iOS applications that truly see the world.