{"id":639,"date":"2025-03-24T05:10:35","date_gmt":"2025-03-24T05:10:35","guid":{"rendered":"https:\/\/blog.ngocha.biz\/?p=639"},"modified":"2025-03-24T05:10:35","modified_gmt":"2025-03-24T05:10:35","slug":"debugging-with-strace-in-linux","status":"publish","type":"post","link":"https:\/\/blog.ngocha.biz\/?p=639","title":{"rendered":"Debugging With Strace in Linux"},"content":{"rendered":"<p>In this blog, we\u2019ll explore:<\/p>\n<ul>\n<li>Strace, a powerful Linux debugging tool.<\/li>\n<li>A practical debugging example using strace.<\/li>\n<li>Why strace is not recommended in production.<\/li>\n<\/ul>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\u26a0\ufe0f<\/div>\n<div class=\"kg-callout-text\">If you\u2019ve never used strace, don\u2019t worry, you\u2019ll probably meet it in an interview when you least expect it. \ud83d\ude00<\/div>\n<\/div>\n<p>Troubleshooting is an important part of every DevOps engineer&#8217;s or SRE&#8217;s job.<\/p>\n<p>It could involve:<\/p>\n<ul>\n<li>Database performance bottlenecks<\/li>\n<li>Web server performance issues<\/li>\n<li>Identifying slow requests<\/li>\n<li>Excessive resource consumption<\/li>\n<\/ul>\n<p>and more\u2026<\/p>\n<p>The&nbsp;<strong>strace<\/strong>&nbsp;utility is useful for debugging various application issues by tracing system calls (syscalls).<\/p>\n<blockquote><p>This blog is for&nbsp;<strong>educational purposes only.<\/strong>&nbsp;Do not use strace in production without understanding its effects on performance.<\/p><\/blockquote>\n<h2 id=\"scenario\">Scenario<\/h2>\n<p>Let\u2019s say the Nginx app wants to&nbsp;<strong>write a log entry<\/strong>&nbsp;to a file on the disk. Since Nginx cannot directly access hardware like disks, it must use syscalls to request kernel services.<\/p>\n<p>For example, Nginx might use the following syscalls:<\/p>\n<ul>\n<li><code>open()<\/code>&nbsp;\u2013 Opens log files once when Nginx starts &#8211; (not per write)<\/li>\n<li><code>write()<\/code>&nbsp;\u2013 Writes the log entries to the file in buffered batches.<\/li>\n<li><code>close()<\/code>&nbsp;\u2013 Closes the file descriptors during log rotation\/shutdown<\/li>\n<\/ul>\n<p>That means Nginx is programmed to use syscalls whenever it needs to interact with the operating system for tasks like file operations, network handling, or process management.<\/p>\n<p>If you need to diagnose an issue related to an application&#8217;s performance, networking, or other system interactions, tracing syscalls can help you identify the problem.<\/p>\n<h2 id=\"what-is-strace\">What is strace?<\/h2>\n<p>Strace is a linux command-line utility that intercepts and records all the system calls a program (eg: nginx) makes and any signals it receives.<\/p>\n<p>Lets look at a simple example.<\/p>\n<p>Lets look at a summary of system calls made when running the&nbsp;<code>ls<\/code>&nbsp;command.<\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2025\/03\/image-335.png\" class=\"kg-image\" alt=\"\" loading=\"lazy\" width=\"1292\" height=\"1268\" srcset=\"https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w600\/2025\/03\/image-335.png 600w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/size\/w1000\/2025\/03\/image-335.png 1000w, https:\/\/storage.ghost.io\/c\/5f\/2f\/5f2f4d20-2abf-4534-8d40-7aa233aedd43\/content\/images\/2025\/03\/image-335.png 1292w\" sizes=\"auto, (min-width: 720px) 720px\"><\/figure>\n<p>Overall, the command made 78 system calls with 7 errors total.<\/p>\n<p>Also, the&nbsp;<strong>execution time was extremely fast<\/strong>&nbsp;(all individual calls show 0.000000 seconds)<\/p>\n<h2 id=\"practical-debugging-example\">Practical Debugging Example<\/h2>\n<p>To look at a practical debugging example using strace, I had to simulate a real-world scenario of a slow-responding page.<\/p>\n<p>To simulate this scenario, here\u2019s what I did:<\/p>\n<ul>\n<li>Deployed Nginx with PHP-FPM.<\/li>\n<li>Created a simple PHP page (<code>\/slow<\/code>) that responds only after 2 seconds. This is just to simulate a real-world scenario of a slow response page.<\/li>\n<\/ul>\n<p>The next step is to capture the syscalls made by Apache when accessing the&nbsp;<code>http:\/\/localhost\/slow<\/code>&nbsp;page.<\/p>\n<h3 id=\"tracing-nginx-syscalls\">Tracing nginx syscalls<\/h3>\n<p>To trace the syscalls, we first need to get the Nginx worker process ID.<\/p>\n<pre><code>ps aux | grep nginx<\/code><\/pre>\n<p>Now, we need to attach strace to the process ID to trace system calls.<\/p>\n<p>Since we are investigating a slow-responding page, our&nbsp;<strong>focus should be on the syscalls involved in handling HTTP<\/strong>&nbsp;requests and responses.<\/p>\n<div class=\"kg-card kg-callout-card kg-callout-card-blue\">\n<div class=\"kg-callout-emoji\">\ud83d\udca1<\/div>\n<div class=\"kg-callout-text\"><b><strong style=\"white-space: pre-wrap;\">recvfrom<\/strong><\/b>&nbsp;syscall receives data from a network socket (e.g., an HTTP request from a client).<\/div>\n<\/div>\n<p>I am going to trace only the&nbsp;<code>recvfrom()<\/code>&nbsp;syscall using the following command. Here,&nbsp;<code>13173<\/code>&nbsp;is the Nginx worker process ID.<\/p>\n<pre><code>sudo strace -e trace=recvfrom \\\n        -p 13173 -f -T -tt -o strace.log<\/code><\/pre>\n<p>Now that strace is tracing the calls, I generated 10 requests with 2 concurrent users to capture the logs in&nbsp;<code>strace.log<\/code>&nbsp;using apache benchmark utility.<\/p>\n<pre><code>ab -n 10 -c 2 http:\/\/localhost\/slow<\/code><\/pre>\n<p>Now that enough requests have been generated, the goal is to debug the slow response issue.<\/p>\n<p>If we check the&nbsp;<code>strace.log<\/code>, we will find complex log entries containing&nbsp;<code>recvfrom<\/code>&nbsp;syscalls, similar to the following.<\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/media.beehiiv.com\/cdn-cgi\/image\/fit=scale-down,format=auto,onerror=redirect,quality=80\/uploads\/asset\/file\/172d6fb4-a720-413d-8278-adc47a711678\/image.png?t=1738319845\" class=\"kg-image\" alt=\"\" loading=\"lazy\" width=\"764\" height=\"507\"><\/figure>\n<h2 id=\"analysis-of-recvfrom-log-for-slow-requests\">Analysis of<strong>&nbsp;<\/strong><code>recvfrom()<\/code><strong>&nbsp;<\/strong>Log for Slow Requests<\/h2>\n<p>Lets look at the first two entries.<\/p>\n<pre><code>13173 10:23:41.655745 recvfrom(5, \"GET \/slow HTTP\/1.0\\r\\nHost: localh\"..., 1024, 0, NULL, NULL) = 81 &lt;0.000011&gt;\n\n13173 10:23:41.656706 recvfrom(12, \"GET \/slow.php HTTP\/1.0\\r\\nHost: lo\"..., 1024, 0, NULL, NULL) = 104 &lt;0.000005&gt;<\/code><\/pre>\n<p>Here,<\/p>\n<ol>\n<li>Client requests&nbsp;<code>\/slow<\/code>.<\/li>\n<li>Nginx forwards it to&nbsp;<code>\/slow.php<\/code>&nbsp;for processing.<\/li>\n<li>Time: 10:23:41.656706.<\/li>\n<\/ol>\n<p>The next log entry is the response from php.<\/p>\n<pre><code>13173 10:23:43.659573 recvfrom(13, \"\\1\\6\\0\\1\\0,\\4\\0Content-type: text\/html;\"..., 4096, 0, NULL, NULL) = 72 &lt;0.000041&gt;<\/code><\/pre>\n<p>Here,<\/p>\n<ol>\n<li>Response received from PHP after 2 seconds (10:23:43.659573).<\/li>\n<li><strong>Delay: 2.003 seconds.<\/strong><\/li>\n<li>This suggests slow PHP execution (Which we simulated)<\/li>\n<\/ol>\n<p><strong>Note:<\/strong>&nbsp;This is just an example of identifying slowness in an HTTP request. Depending on the issue, the relevant syscalls and the approach to analysis may vary.<\/p>\n<h2 id=\"strace-performance\">Strace &amp; Performance<\/h2>\n<p>Strace is not advised to be used in production environments.<\/p>\n<p>One of the problems with strace is that it slows down applications significantly, especially in production.<\/p>\n<p>It uses&nbsp;<code>ptrace()<\/code>, which pauses the target process twice per system call (entry and exit). It could even lead to application crashes.<\/p>\n<p>This frequent pausing and context-switching adds high overhead to applications.<\/p>\n<p>You can read more about it from&nbsp;<a href=\"https:\/\/www.brendangregg.com\/blog\/2014-05-11\/strace-wow-much-syscall.html?utm_source=newsletter.devopscube.com&#038;utm_medium=referral&#038;utm_campaign=debugging-with-strace-in-linux\" rel=\"noopener noreferrer nofollow\"><em>Brendan Gregg&#8217;s Blog<\/em><\/a><\/p>\n<p>But eBPF based tools lilke bpftrace have minimal overhead.<\/p>\n<figure class=\"kg-card kg-image-card\"><img decoding=\"async\" src=\"https:\/\/media.beehiiv.com\/cdn-cgi\/image\/fit=scale-down,format=auto,onerror=redirect,quality=80\/uploads\/asset\/file\/839842e0-fc34-4a4c-991d-c4dc0c12e3f6\/image.png?t=1738995410\" class=\"kg-image\" alt=\"\" loading=\"lazy\" width=\"697\" height=\"395\"><\/figure>\n<h2 id=\"wrapping-up\">Wrapping Up<\/h2>\n<p>This is just an example to show how to use strace to debug system calls. In real-world scenarios, the syscalls and issues you debug will be different.<\/p>\n<p>Also, strace-related questions are&nbsp;<strong>common in interviews<\/strong>&nbsp;for product companies or roles that require Linux knowledge.<\/p>\n<p>Hope this blog was helpful.<\/p>\n<p>Want to Stay Ahead in DevOps &amp; Cloud? Join Free Newsletter Below.<\/p>\n<p><!--kg-card-begin: html--><br \/>\n<iframe loading=\"lazy\" src=\"https:\/\/embeds.beehiiv.com\/2a495ef4-3de7-4600-8a0d-de5dc968b372\" data-test-id=\"beehiiv-embed\" width=\"100%\" height=\"320\" frameborder=\"0\" scrolling=\"no\" style=\"border-radius: 4px; border: 2px solid #e5e7eb; margin: 0; background-color: transparent;\"><\/iframe><\/p>\n<p><!--kg-card-end: html--><\/p>\n<hr>\n<p><strong>Ngu\u1ed3n:<\/strong> <a href=\"https:\/\/devopscube.com\/debugging-with-strace-in-linux\/\" target=\"_blank\" rel=\"noopener noreferrer\">Debugging With Strace in Linux \u2014 DevOpsCube<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Source: https:\/\/devopscube.com\/debugging-with-strace-in-linux\/<\/p>\n","protected":false},"author":1,"featured_media":640,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-639","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-devops"],"_links":{"self":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/posts\/639","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=639"}],"version-history":[{"count":0,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/posts\/639\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=\/wp\/v2\/media\/640"}],"wp:attachment":[{"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=639"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=639"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.ngocha.biz\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=639"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}