Parsing human regions into semantic parts, e.g., body, head and arms etc., from a random natural image is challenging while fundamental for computer vision and widely applicable in industry. One major difficulty to handle such a problem is the high flexibility of scale and location of a human instance and its corresponding parts, making the parsing task either lack of boundary details or suffer from local confusions. To tackle such problems, in this work, we propose the "Auto-Zoom Net" (AZN) for human part parsing, which is a unified fully convolutional neural network structure that: (1) parses each human instance into detailed parts. (2) predicts the locations and scales of human instances and their corresponding parts. In our unified network, the two tasks are mutually beneficial. The score maps obtained for parsing help estimate the locations and scales for human instances and their parts. With the predicted locations and scales, our model "zooms" the region into a right scale to further refine the parsing. In practice, we perform the two tasks iteratively so that detailed human parts are gradually recovered. We conduct extensive experiments over the challenging PASCAL-Person-Part segmentation, and show our approach significantly outperforms the state-of-art parsing techniques especially for instances and parts at small scale.