silverbullet-notes/docs/06-troubleshooting.md
2026-01-25 00:20:24 +00:00

198 lines
7.6 KiB
Markdown

# Troubleshooting Guide
_Last updated: 2026-01-05_
This guide provides solutions to common issues encountered in this Docker-based infrastructure.
## Issue: Container is restarting or won't start
**Symptoms:**
- `docker ps` shows the container is `restarting` or `exited`.
- `docker-compose up -d` command fails with an error.
**Diagnosis:**
1. **Check the logs:** The first step is always to check the container's logs.
```bash
docker-compose logs -f <service-name>
```
Look for error messages, stack traces, or any indication of what might be wrong.
2. **Check dependencies:** If the container depends on other services (e.g., a database), ensure those services are running and healthy.
```bash
docker-compose ps
```
3. **Check configuration:**
- **Environment variables:** Ensure all required environment variables are set correctly in the `.env` file or `docker-compose.yml`.
- **Volumes:** Verify that all volume paths are correct and that the files and directories on the host have the correct permissions. The user running the Docker container (often specified with `PUID` and `PGID`) needs to have read and write access to the volume paths.
- **Ports:** Check for port conflicts. If another service on the host is using the same port, the container will fail to start. Use `sudo lsof -i -P -n | grep LISTEN` to check for listening ports.
**Resolution:**
- Once the root cause is identified from the logs or configuration check, address the issue. This may involve:
- Correcting an environment variable.
- Fixing file permissions on a volume.
- Changing a port mapping.
- Restarting a dependency.
- After applying the fix, try starting the container again:
```bash
docker-compose up -d --force-recreate <service-name>
```
## Issue: 502 Bad Gateway from Traefik
**Symptoms:**
- Accessing a service through its domain (e.g., `https://books.3ddbrewery.com`) results in a "502 Bad Gateway" error from Traefik.
**Diagnosis:**
1. **Check the Traefik dashboard:** The Traefik dashboard (if accessible) provides a wealth of information about routers, services, and middleware. Look for any errors related to the service in question.
2. **Check Traefik's logs:**
```bash
docker logs traefik
```
Look for errors related to the service, such as "no servers found".
3. **Check the service's logs:**
```bash
docker-compose logs -f <service-name>
```
The service itself might be crashing or unhealthy.
4. **Check network connectivity:**
- Ensure the service is connected to the `traefik_proxy` network in its `docker-compose.yml`.
- From the Traefik container, try to ping the service's container.
```bash
docker exec -it traefik /bin/sh
ping <container_name>
```
5. **Check Traefik labels:**
- Ensure the `traefik.http.services.<service-name>.loadbalancer.server.port` label in the `docker-compose.yml` file is set to the correct port that the container is exposing.
- Verify that all Traefik labels are correctly formatted.
**Resolution:**
- **Service not on `traefik_proxy` network:** Add the service to the `traefik_proxy` network in its `docker-compose.yml`.
- **Incorrect port:** Correct the port in the `traefik.http.services.<service-name>.loadbalancer.server.port` label.
- **Service not running:** Troubleshoot the service using the "Container is restarting" guide above.
## Issue: 404 Not Found from Traefik
**Symptoms:**
- Accessing a service through its domain results in a "404 Not Found" error.
**Diagnosis:**
1. **Check the Traefik dashboard:** Verify that a router has been created for the domain you are trying to access.
2. **Check the `rule` label:** Ensure the `traefik.http.routers.<service-name>.rule` label is set to the correct `Host(...)`.
3. **Check DNS:** Make sure your DNS is correctly pointing the domain to the IP address of the Traefik server.
**Resolution:**
- **Incorrect rule:** Correct the `Host(...)` rule in the `docker-compose.yml` file.
- **DNS issue:** Correct the DNS record for the domain.
## Issue: Authentication Failures
**Symptoms:**
- Being unable to log in to a service that is protected by Authelia.
- Seeing "Unauthorized" or "Forbidden" errors.
**Diagnosis:**
1. **Check Authelia's logs:**
```bash
docker logs authelia
```
Look for any errors related to the authentication attempt.
2. **Check the application's logs:** The application might be rejecting the authentication for some reason.
```bash
docker-compose logs -f <service-name>
```
In the case of `books_webv2`, check the backend logs for any errors related to the `Remote-User` header.
3. **Check the Traefik middleware:** Ensure the `traefik.http.routers.<service-name>.middleware` label is correctly set to `authelia-brewery` or `authelia-fails`.
**Resolution:**
- **Restart Authelia:** Sometimes, simply restarting Authelia can resolve issues.
```bash
docker restart authelia
```
- **Check user credentials:** Double-check the username and password.
- **Check Authelia configuration:** Review Authelia's `configuration.yml` for any errors.
## Issue: MariaDB/MySQL Replication Stopped
**⚠️ CURRENT STATUS**: As of January 2026, `node` database replication has been **intentionally disabled**. All applications connect directly to the primary server (`192.168.1.251`). This section is retained for reference if replication is re-enabled in the future.
**Symptoms:**
- Secondary database server shows `Replica_IO_Running` or `Replica_SQL_Running` as `No`.
- `Seconds_Behind_Source` is not `0` or shows a large number.
- Applications using the secondary database have stale data.
**Diagnosis:**
1. **Check replication status on secondary server:** Connect to the secondary database server using phpMyAdmin or MySQL client and run:
```sql
SHOW REPLICA STATUS\G
```
Or for older versions:
```sql
SHOW SLAVE STATUS\G
```
2. **Check key fields:**
- `Replica_IO_Running`: Should be `Yes`
- `Replica_SQL_Running`: Should be `Yes`
- `Seconds_Behind_Source`: Should be `0`
- `Last_Error`: Should be empty - if there's an error here, it will indicate what went wrong
3. **Check primary server status:**
```sql
SHOW MASTER STATUS;
```
Note the `File` and `Position` values.
4. **Check binary log settings:** Ensure binary logging is enabled on the primary server:
```sql
SHOW VARIABLES LIKE 'log_bin';
```
**Resolution:**
**Common Fix - Restart Replication:**
```sql
-- On secondary server
STOP REPLICA;
START REPLICA;
SHOW REPLICA STATUS\G
```
**If there's a specific error:**
- **Skip one transaction (if error is known to be safe):**
```sql
STOP REPLICA;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START REPLICA;
```
**⚠️ Warning:** Only use this if you understand the error and know it's safe to skip.
**If replication is completely broken:**
- **Re-establish replication from current position:**
1. Get current position from primary:
```sql
-- On primary
SHOW MASTER STATUS;
```
2. Reset and reconfigure replica:
```sql
-- On secondary
STOP REPLICA;
CHANGE MASTER TO
MASTER_LOG_FILE='<file from primary>',
MASTER_LOG_POS=<position from primary>;
START REPLICA;
SHOW REPLICA STATUS\G
```
**Prevention:**
- Monitor replication status regularly
- Ensure both servers have sufficient disk space
- Check network connectivity between primary and secondary servers
- Review MariaDB error logs: `/var/log/mysql/error.log`