indicodata.ai
The Good, Bad, & Ugly of TensorFlow - Indico Data
Excerpt
**Model checkpointing.**Train a model for a while. Stop to evaluate it. Reload from checkpoint, keep training. **Performance and GPU memory usage are similar to Theano and everything else that uses CUDNN.**Most of the performance complaints in the earlier releases appear to have been due to using CUDNNv2, so TensorFlow v0.8 (using CUDNNv4) is much improved in this regard. … **Lack of authoritative examples for data ingestion.** The TensorFlow docs and examples focus on using several well-known academic datasets to demonstrate various features or functionality. This totally makes sense, and is a good thing to prioritize for general consumption. But real-world problems are rarely drop-in replacements for these kinds of datasets. Working with tensor inputs and shapes can be a real stumbling block when learning a new deep learning framework, so an example or two showing how to work with messy input data (weird shapes, padding, distributions, tokenization, etc.) could save a lot of pain for future developers/engineers. **Documentation can be inconsistent.** There are a number of good tutorials available for TensorFlow, and the code itself is very well commented (thank you, authors). But machine learning/deep learning is deep and wide domain, and there is a lag between new functionality and docs/tutorials explaining how to build stuff. A few of our favorites tutorials are: … Unfortunately, especially for RNNs, there are still conceptual gaps in the documentation and tutorials, such as the gap between the simple or trivial examples and the full-on state-of-the-art examples. This can be a real barrier for developers who are trying learn the concepts at the same time as they are learning the framework. For example, the Udacity tutorials and the RNN tutorial using Penn TreeBank data to build a language model are very illustrative, thanks to their simplicity. They are good illustrations to learn a concept, but too basic for real-world modeling tasks. … ### The Ugly **Heterogeneous resource utilization adds complexity.** A classic engineering tradeoff between control and simplicity—if you want fine-grained control over how your operations execute (e.g., which GPU node), then you need to maintain these constraints. In some cases, fine-grained control is necessary to maximize performance. For example, using multiple threads to fetch and pre-process a batch of data before feeding the GPU, so the GPU doesn’t wait on these operations. For more detail on using asynchronous runners on CPUs to feed GPUs, or to benchmark your own queues, see Luke’s excellent post, TensorFlow Data Input (Part 2): Extensions. **TensorFlow can hog a GPU.** Similarly, on startup, TensorFlow tries to allocate all available GPU memory for itself. This is a double-edged sword, depending on your context. If you are actively developing a model and have GPUs available to you in a local machine, you might want to allocate portions of the GPU to different things. However, if you are deploying a model to a cloud environment, you want to know that your model can execute on the hardware available to it, without unpredictable interactions with other code that may access the same hardware. … ### Summary It takes a fair amount of effort to implement end-to-end workflows in any framework, and TensorFlow is no exception. Some things (queues, certain graph operations, resource allocation/context management, graph visualization) from TensorFlow are all relatively new to the deep learning scene and like many, we’re still learning the best ways to exploit these features. Other things have been available in other frameworks for some time. Even though the overall concept is similar, implementation details can differ. We appreciate all the effort Google developers have put into implementing good abstractions (e.g., streaming data from queues).
Related Pain Points
Poor Data Ingestion Documentation and Examples
6TensorFlow documentation focuses on well-known academic datasets but lacks authoritative examples for real-world data ingestion with messy input data (weird shapes, padding, distributions, tokenization), creating a significant learning barrier for practical applications.
GPU Memory Hogging and Allocation Issues
6TensorFlow attempts to allocate all available GPU memory on startup, which can prevent other code from accessing the same hardware and limits flexibility in local development environments where developers want to allocate portions of GPU to different tasks.
Inconsistent Documentation and Tutorial Gaps
5TensorFlow documentation is inconsistent with lags between new functionality and documentation/tutorials. There are conceptual gaps between simple examples and state-of-the-art examples, particularly for RNNs, creating barriers for developers learning both concepts and the framework simultaneously.