In April 2021, Microsoft Research hosted a two-day workshop introducing their platform for situated intelligence (PSI). The workshop was recorded and videos are available online at YouTube, also embedded here with notes.
Situated intelligence and context-aware computing are central to my research interests. I am keen to see analytical and predictive models that incorporate time and context – that are situated rather than generalised – and that can embrace, or at least represent and visualise, human perception and reasoning when applied to real-world situations. I am also interested in how we can better represent and incorporate uncertainty and ambiguity in such models, and reveal the cultural biases and messiness that are prevalent in human decisions and behaviours. However, I have growing reservations about some of the developments, as fascinating as the possibilities are. Scientists and engineers are pushing forward to develop machine intelligence on a par with how humans perceive the world. Whilst I can comprehend the benefits of delegating narrow tasks to AI that outperform us, I’m not sure what the end goal is for building an AI that closely mimics human consciousness… But that is perhaps a philosophical debate for another post.
The following includes soundbites direct from the videos (times in brackets indicate approximate point in video the content is relevant to) with occasional thoughts and comments (usually in brackets). Note this is not a verbatim transcript. For specific quotes to re-use, watch the video! At the very least, I recommend viewing the opening remarks. For the more technically curious, the overview of the platform is also well worth a view.
Opening remarks (13:24)
Looking at new kinds of interactive real-time AI systems. Particularly interested in collaborative fluid manner between AI and people. Opportunities to progress include advancing human-AI complimentary, raising the fluidity with which AI systems coordinate with people, and achieving mutual grounding – giving AI the ability to develop shared understandings with people about the task at hand and the overall situation in which people and machines are jointly immersed. Achieving these goals will require systems to perceive and make sense of streams of information across the multiple modalities that we as humans depend on to understand the world, that we depend on when we solve problems and when we work with one another.
In a previous effort – the Situated Interaction project – sought to endow systems with the ability to see, listen and speak, and to leverage multiple sensory feeds to understand critical aspects of language, gestures, and the surrounding physical environment. Learned a lot from many multi-year system building efforts, including: the receptionist (white female avatar), the assistant (white female avatar), the directions robot (small toy-like robot with arms), and the third-generation elevator projects (n/a). All relied on integrative AI solutions. Learned a lot about the need to support multiple and continuous loops of perception, reasoning and action. Including the system’s own actions and their effects on the world. Learned a great deal about the hard engineering challenges with building, debugging and extending multi-model integrative AI systems. Learnings included the difficulties grappling with widely different timings and time constants of signals and the processing across different modalities like vision and speech, and the different delays that come via processing of components that are employed to analyse and fuse inferences. (17:00) Timing issues are amplified with the stacking of inferences into pipelines of analyses. The timing challenges are compounded in that completion times can be stochastic and non-deterministic. Another pain point – common debugging techniques such as tracing and breakpoints have not provided efficient means of gaining insights into failures with large-scale end-to-end integrative AI systems.
Motivated by experiences with building these systems and the goal of enabling faster-paced progress on multi-modal systems and applications for human-AI collaboration, the team decided to pause to do a deep-dive on the creation of this SI framework. Effort has focused on developing a distributed execution infrastructure that supports asynchronous processes of perception and inference. The team defined core data structures that make it efficient to represent and reason about time, as well as the key constructs of space and uncertainty. That is, time, space and uncertainty are first-class objects in the system. The team also put special emphasis on new kinds of debugging capabilities, including rich visualisations. And setup what is now a growing ecosystem of components. On another front, and one at the foundations of integrative-AI, PSI’s time-centric runtime provides opportunities for doing new kinds of systems-level learning and optimisation. Harnessing ML for meta-reasoning and control of integrative AI systems. For example, to employ deep reinforcement learning to dynamically guide trade-offs on the efficiency versus accuracy of the analysis of streams of perceptual data. PSI provides access to handles for data collection about system operation. This kind of meta-level learning and decision making can be a pathway not only to the optimisation of specific applications, but a window into research on principles of integrative intelligence, for exploring approaches to the orchestration of perception and reasoning. (Is this the ‘airplanes don’t need feathers to fly’ breakthrough for artificial consciousness…?) The most recent work has been to introduce higher level data structures and programming abstractions. Adding constructs to the system on the grounding of understanding attention and the guidance of conversation.
Reflecting on the promise and necessity of integrative approaches to AI moving forward. There has been great excitement about the power of deep neural networks. The jumps over the last decade in vision, natural language and speech recognition have been awesome and surprising. We are now seeing advances in building multi-modal neural models that leverage joint vision and language datasets. These developments will play an important role in advancing human-AI collaborative abilities. However, I don’t believe the challenges with rich fluid multi-modal collaboration between humans and AI systems will be handled magically and in a solitary way by one or a few deep neural models. For the foreseeable future, we will need to do significant work on the coordination of multiple AI competencies and components to engineer systems that can perceive and make sense of events, people and surroundings, and to take sequences of actions in coordination with the actions, needs and understandings of people. Making progress will depend on advances in both scientific and engineering capabilities (assume ‘scientific’ includes the social sciences…?)
There is a chasm between where we are today in human-AI collaboration and the advances in AI for performing single vertical tasks like object recognition and speech recognition. One source of this gap, we believe, is that we have not had good tools and methods for building and debugging integrative AI systems. Rich integrations of multiple competencies are going to be required for these solutions. We hope SI can help close this gap…
Platform for Situated Intelligence Overview (22:48)
The Platform for Situated Intelligence (\psi, using PSI here) is an open source framework that simplifies development and research in building multimodal, integrative-AI systems. i.e. systems that leverage different types of streaming data including audio, video and depth information, and that integrate and coordinate multiple AI technologies to process this data in real-time. Prototypical example – robot using array of sensors to integrate with its surroundings and multiple people to give directions inside a building. But scope and reach of platform is much broader – any time you are dealing with streaming temporal data or need to bring together multiple kinds of technologies, especially in situations where you are latency constrained or where acting in real-time matters, the affordances PSI provides can accelerate development.
Building end-to-end systems for perceptive AI remains a daunting task. Existing tools lack essential primitives and are not well-tuned to this type of work. PSI as a platform was developed to address these challenges and simplify development for this type of system.
Challenges: many stem from the multimodal and integrative nature of perceptive systems. These systems work with different types of streaming data: audio, video, depth, LIDAR, post-tracking results, speech recognition results etc. These data are generally streaming at different bandwidths and with different latencies. Typically processed in complex pipelines that bring together many kinds of components built out of heterogeneous technologies. Sensor-specific devices, cloud services, neural nets running on GPUs, application-specific code and interaction with databases etc. The systems tend to be compute-intensive and often need to operate under latency constraints. So performance considerations are important. Often you want these systems to behave well under load and under latency. Properties such as graceful degradation of performance are often a requirement. Even if you leave aside a challenges that stem from the sheer complexity that these systems have, and that they have heterogeneous components, and have stringent performance requirements, a number of other challenges arise from the fact that the programming languages that we use today are missing an important set of primitives, are not well attuned for these sorts of application. Notions such as time, space and uncertainty are core in a lot of these systems, yet these are not first-order objects in any of our programming languages today. As a result, a lot of time ends up being spent dealing with and debugging low-level infrastructure issues.
e.g Time: these systems need to be fast but being fast is not enough. Coordination requires being latency-aware. Often, these systems need to perform data fusion, where a component needs to integrate information coming from different pathways. To do this correctly and in a reproducible manner, the information arriving on these different streams needs to be paired according to when those events happened in the world, not according to when they reached this component.
These applications also have a specific set of needs when it comes to debugging and data visualisation and analytics, that are not sufficiently addressed by the development tools we have today. Debugging complex and concurrent applications with breakpoints and print defs is not really scaleable or effective. The ability to inspect and visualise the data streams as they flow through the application, and perform data analytics, can speed up the development cycle.
The main goal of the PSI framework is to lower engineering costs and foster more innovation and research in this space. It is available on Github – github.com/Microsoft/psi – and has cross-platform support. Is built on the .NET standard. The framework has three parts: Runtime, Tools, and Components (see image below).
PSI applications are basically pipelines or graphs of connected components that talk to each other via streams of data. See image below. Code on the left-hand side demonstrates fictitious example on the right, combining two types of sensed data – microphone and camera. Say you want to detect who out of many people in an image is talking.
Once you call the run method on the pipeline, the PSI run method takes charge of executing the pipeline, schedules the components for execution and starts flowing messages on these temporal streams of data.
The streams infrastructure in PSI has a number of important properties. First, they are strongly-typed. Based on .NET generics.
Time is a first-order primitive, construct. Each of the messages is timestamped not only with a creation time – when that component generated it – but also with what we refer to as an originating time – this is the time at the source, when the message first enters the pipeline, e.g. the camera first sees an image, or when a microphone picks up a beat of audio. This originating time is automatically carried forward as the message goes downstream (including any latency in reaching next step in pipeline – e.g. in image below, both sources trigger at the same time (12:00, not showing on slide) but travel through the pipeline with different latencies at each step, also captured and retained).
This latency-awareness throughout the entire graph enables a number of important scenarios. One is synchronisation and data fusion. Using the example above, when the messages reach the Speech Source Detector component, it knows that these two messages arrived with specific latencies. It can buffer messages and pair (coordinate) them according to their originating times. PSI provides a large number of primitives that support this kind of synchronisation, data fusion, interpolation and all kinds of operations to eliminate the need for the developer to think about how this exactly needs to be done.
Data can also be easily logged, providing persistence. Can just apply a .NET write stream operator on any of these streams and send the data to disk. The framework automatically generates custom serialisation codes. You can log most .NET data types.
Because timing information is persisted together with the data, we can enable reproducible replay. Can eliminate the sensors in an application, resurface the streams from a previously persisted store and re-run your app in a reproducible fashion to tune the application and improve downstream components to get the results that you want. This sort of reproducible experimentation is an important accelerator in this space.
Runtime also has access to latency information, creates opportunities to improve scheduling and maintain steady states… Also allows the developers to control how the application behaves under load, e.g. decide which messages it is OK to drop or not… more info on the Github site.
To write your own components for the runtime, e.g. for the Speech Source Detector used in this example, just have to write a custom class and implement two receiver methods, one for each incoming stream. Then from inside the receives you can post messages to send downstream.
The PSI runtime provides state protection. You can define your own state variable, e.g. time since person last spoke (in image above). Can touch this variable from both receives. Can write or read it without having to lock it or worry about concurrency issues. When the PSI runtime schedules the various components in the pipeline for execution, it tries to maximise pipeline parallelism so will execute receivers of different components in parallel on multiple threads, but will always ensure that receivers of a given component are executed exclusive with respect to each other. This exclusivity is enforced automatically by the runtime. So you don’t have to worry about concurrency.
State Protection links another important property – Isolated Execution. Whenever one of the receivers receives one of the messages, it can modify and edit it internally however it wishes. Because if this message is being sent to a number of other receives at other components, the PSI runtime automatically ensures that every receiver gets its own copy. There is an automatic cloning mechanism implemented that insulates the component writer from the intricacies of a concurrent execution environmnet.
You can also create a hierarchical composition to help with code encapsulation and keeping it simple. Also means can get fine-grained parallelism within each encapsulated element of the system (and presumably can resolve conflicts if clones are changed separately by different receivers and need to resolve those changes for some downstream action in the pipeline).
Focused a lot on data visualisation. PSI Studio provided to enable visualisation of data collected by PSI applications. Also provided timeline visualisers, latency visualisers, 2D and 3D visualisers of the streaming data, e.g. LIDAR point clouds, camera images…
Studio enables you to navigate the data and also to annotate the behaviour of the application. Can be done offline with persisted data but also can be viewed with live data streams. Framework has support for batch processing to handle large data sets. Can run as multiple sessions to accumulate and then batch process. Future plans include developing fully interactive GUI.
Components currently available on Github are predominantly for multi-modal sensing and processing, sensors such as cameras and microphones, processing audio and image data, speech and language processing, and for running ML/Onnx models, and wrapping cloud services such as Azure Cognitive Services. Aiming to help start the ecosystem and encourage others to contribute.
Q&A session (50:00)
Framework runs on multiple systems, have even run it on a Raspberry Pi…
Example in the data visualisation – based on an Azure Connect camera. Is in the Samples repro, including the entire example app.
Using a bridging framework for interoperating with legacy monolithic systems. Basically, anywhere that supports .NET core framework, PSI can run (question as to whether or not it will run OSX… in theory/PoC, yes)
Rest of workshop is focused on hands-on tutorials and technical details:
57:14 – Coding tutorial
1:45:02 – Data visualisation and annotation with the platform studio
2:43:42 – Building an open source community and panel discussion
5:05 – Bugging and diagnostics
50:03 – Interoperating with Python
1:21:51 – Interoperating with ROS
1:56:19 – Interoperating with HoloLens 2
2:23:02 – Interoperating with MS Teams
2:47:51 – Situated Interaction Foundation (a preview – toolkit for situated interactive applications)
3:30:29 – Closing remarks (very brief)